Closed marfox closed 9 years ago
Hello Marco, I am trying to reproduce the issue but the script expects FE columns headers to match the regex fe[0-9]{2}
whereas the sample result from crowdflower doesn't. Moreover, that sentence is not included in the sample. Are you using a different sample from the one included in the resources folder? If so, could you please add it to the repo?
Hey Marco, @e-dorigatti is right, and the problem is not just the regex. In the further code all keys of the rows are expected to have two numbers in the end, while the provided sample crowdlfower results just have one number (like e.g. orig_fe..., fe...). I saw you changed the script in commit 5a45f56 to fit the new data. Can you please provide the new results sample data?
Whoops, I forgot to commit that file! Fixed
On 4/17/15 9:23 AM, e-dorigatti wrote:
Hello Marco, I am trying to reproduce the issue but the script expects FE columns headers to match the regex |fe[0-9]{2}| whereas the sample result from crowdflower doesn't. Moreover, that sentence is not included in the sample. Are you using a different sample from the one included in the resources folder? If so, could you please add it to the repo?
— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/34#issuecomment-93933066.
Hey Marco, there are also some output samples from TreeTagger missing. Can you provide them, too? Or should we just build them ourselves?
I've just updated them for your convenience
On 4/17/15 10:49 AM, Felix Sonntag wrote:
Hey Marco, there are also some output samples from TreeTagger missing. Can you provide them, too? Or should we just build them ourselves?
— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/34#issuecomment-93946082.
The tags seem to be wrong, for example:
70 5 campionati ENT campionato Attività B-Competizione_Attività
70 6 islandese ENT islandese Attività B-Competizione_Attività
Should I fix this too?
What do you mean? If you mean the 'ENT' tag should not exist, we are minting a specific tag for annotated n-grams, so that's correct. If you mean there shouldn't be 2 'ENT' tags in the example, this originates from the chunk combination, thus not related to this issue
On 4/17/15 11:02 AM, e-dorigatti wrote:
The tags seem to be wrong, for example:
70 5 campionati ENT campionato Attività B-Competizione_Attività 70 6 islandese ENT islandese Attività B-Competizione_Attività Should I fix this too?
— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/34#issuecomment-93948000.
No, I was talking about B-Competizione_Attività
. Shouldn't islandese
be tagged as I-Competizione_Attività
?
No, we are changing strategy here. We want the entities to be n-grams. Your example is an Italian corner case, as we have 2 different championships referenced with a common word 'campionati' in plural form. So, this is a tricky one and I'm open to discussion. Ideally, we should resolve the co-reference, but that's another story. I would stick to 3 isolated entities by now, any suggestions?
On 4/17/15 11:09 AM, e-dorigatti wrote:
No, I was talking about |B-Competizione_Attività|. Shouldn't |islandese| be tagged as |I-Competizione_Attività|?
— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/34#issuecomment-93948937.
We are trying to build training data in an n-gram fashion, replacing single tokens with the full annotated entity. See for instance the following sentence:
See full sample output. The problem arises here.