dbpedia / fact-extractor

Fact Extraction from Wikipedia Text
529 stars 79 forks source link

If the same initial token appears in more annotations, training data only gets the first annotation #34

Closed marfox closed 9 years ago

marfox commented 9 years ago

We are trying to build training data in an n-gram fashion, replacing single tokens with the full annotated entity. See for instance the following sentence:

19  0   Ha  VER:pres    avere   Attività   O
19  1   giocato VER:pper    giocare Attività   B-LU
19  2   7   NUM @card@  Attività   O
19  3   partite NOM partita Attività   O
19  4   per PRE per Attività   O
19  5   la  DET:def il  Attività   O
19  6   Nazionale cipriota  ENT nazionale   Attività   B-Squadra_Attività
19  7   tra PRE tra Attività   O
19  8   il 2004 ENT il  Attività   B-Tempo_Attività
19  9   e   CON e   Attività   O
19  10  il 2004 ENT il  Attività   B-Tempo_Attività

See full sample output. The problem arises here.

e-dorigatti commented 9 years ago

Hello Marco, I am trying to reproduce the issue but the script expects FE columns headers to match the regex fe[0-9]{2} whereas the sample result from crowdflower doesn't. Moreover, that sentence is not included in the sample. Are you using a different sample from the one included in the resources folder? If so, could you please add it to the repo?

fsonntag commented 9 years ago

Hey Marco, @e-dorigatti is right, and the problem is not just the regex. In the further code all keys of the rows are expected to have two numbers in the end, while the provided sample crowdlfower results just have one number (like e.g. orig_fe..., fe...). I saw you changed the script in commit 5a45f56 to fit the new data. Can you please provide the new results sample data?

marfox commented 9 years ago

Whoops, I forgot to commit that file! Fixed

On 4/17/15 9:23 AM, e-dorigatti wrote:

Hello Marco, I am trying to reproduce the issue but the script expects FE columns headers to match the regex |fe[0-9]{2}| whereas the sample result from crowdflower doesn't. Moreover, that sentence is not included in the sample. Are you using a different sample from the one included in the resources folder? If so, could you please add it to the repo?

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/34#issuecomment-93933066.

fsonntag commented 9 years ago

Hey Marco, there are also some output samples from TreeTagger missing. Can you provide them, too? Or should we just build them ourselves?

marfox commented 9 years ago

I've just updated them for your convenience

On 4/17/15 10:49 AM, Felix Sonntag wrote:

Hey Marco, there are also some output samples from TreeTagger missing. Can you provide them, too? Or should we just build them ourselves?

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/34#issuecomment-93946082.

e-dorigatti commented 9 years ago

The tags seem to be wrong, for example:

70  5   campionati  ENT campionato  Attività   B-Competizione_Attività
70  6   islandese   ENT islandese   Attività   B-Competizione_Attività

Should I fix this too?

marfox commented 9 years ago

What do you mean? If you mean the 'ENT' tag should not exist, we are minting a specific tag for annotated n-grams, so that's correct. If you mean there shouldn't be 2 'ENT' tags in the example, this originates from the chunk combination, thus not related to this issue

On 4/17/15 11:02 AM, e-dorigatti wrote:

The tags seem to be wrong, for example:

70 5 campionati ENT campionato Attività B-Competizione_Attività 70 6 islandese ENT islandese Attività B-Competizione_Attività

Should I fix this too?

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/34#issuecomment-93948000.

e-dorigatti commented 9 years ago

No, I was talking about B-Competizione_Attività. Shouldn't islandese be tagged as I-Competizione_Attività?

marfox commented 9 years ago

No, we are changing strategy here. We want the entities to be n-grams. Your example is an Italian corner case, as we have 2 different championships referenced with a common word 'campionati' in plural form. So, this is a tricky one and I'm open to discussion. Ideally, we should resolve the co-reference, but that's another story. I would stick to 3 isolated entities by now, any suggestions?

On 4/17/15 11:09 AM, e-dorigatti wrote:

No, I was talking about |B-Competizione_Attività|. Shouldn't |islandese| be tagged as |I-Competizione_Attività|?

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/34#issuecomment-93948937.