Handle a new input format in the CrowdFlower-to-training data conversion script

marfox commented 9 years ago

Adapt crowdflower_results_into_training_data.py given this CrowdFlower results sample. Please make a pull request against the no-chunker branch.

e-dorigatti commented 9 years ago

How should different FEs with the same name be handled? For example, sentence 79 ("Dopo aver giocato nella Dinamo Kiev, nel 2008 si trasferisce all'Amkar Perm") has two distinct FEs both tagged as "Squadra", "Dinamo Kiev" and "Amkar Perm". Should I use some sort of progressive identifier? This would produce the following tags:

Dinamo    B-Squadra1_Attività
Kiev      I-Squadra1_Attività
Amkar     B-Squadra2_Attività
Perm      I-Squadra2_Attività

marfox commented 9 years ago

Nope, the FE should remain the same. But it will be interesting to check if your suggestion has an impact on the classifier performances, so can you keep it as an alternative behavior for later use?

On 4/2/15 12:36 PM, e-dorigatti wrote:

How should different FEs with the same name be handled? For example, sentence 79 ("Dopo aver giocato nella Dinamo Kiev, nel 2008 si trasferisce all'Amkar Perm") has two distinct FEs both tagged as "Squadra", "Dinamo Kiev" and "Amkar Perm". Should I use some sort of progressive identifier? This would produce the following tags:

Dinamo B-Squadra1_Attività Kiev I-Squadra1_Attività Amkar B-Squadra2_Attività Perm I-Squadra2_Attività

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/32#issuecomment-88862265.

e-dorigatti commented 9 years ago

Sure. How should punctuation be handled? At the moment I have to include it into tagging, otherwise sequences of contiguous tokens cannot be recognized as belonging to the same entity. For example, sentence 53 ("Ha giocato nella massima serie dei campionati russo, turco, azero e finlandese.") has tokens "russo" at position 7, token "turco" at position 8 and token "azero" at position 9. It is impossible to tell that they belong to different FEs (do they), because the rule is that contiguous tokens of the same type (Competizione, in this case) belong to the same FE.

So what I have at the moment is this, which I assume is not the expected result:

53  0   Ha          VER:pres        avere       Attività   O
53  1   giocato     VER:pper        giocare     Attività   B-LU
53  2   nella           PRE:det     nel         Attività   O
53  3   massima     ADJ         massimo     Attività   B-Competizione1_Attività
53  4   serie           NOM     serie           Attività   I-Competizione1_Attività
53  5   dei         PRE:det     del         Attività   O
53  6   campionati  NOM     campionato  Attività   O
53  7   russo       ADJ         russo       Attività   B-Competizione2_Attività
53  8   ,           PON         ,           Attività   I-Competizione2_Attività
53  9   turco       ADJ         turco       Attività   I-Competizione2_Attività
53  10  ,           PON         ,           Attività   I-Competizione2_Attività
53  11  azero       ADJ         azero       Attività   I-Competizione2_Attività
53  12  e           CON         e           Attività   O
53  13  finlandese  ADJ         finlandese  Attività   B-Competizione3_Attività
53  14  .           SENT        .           Attività   O

The problem is that I am using column headers to find token position, so that "russo", "turco" and "azero" only appear to be contiguous. I can use treetagger's output to determine tokens' position, but what if there are identical words? This would mess up the indexing...

marfox commented 9 years ago

On 4/2/15 5:14 PM, e-dorigatti wrote:

Sure. How should punctuation be handled? At the moment I have to include it into tagging, otherwise sequences of contiguous tokens cannot be recognized as belonging to the same entity. I see, but punctuation should be tagged with O For example, sentence 53 ("Ha giocato nella massima serie dei campionati russo, turco, azero e finlandese.") has tokens "russo" at position 7, token "turco" at position 8 and token "azero" at position 9. It is impossible to tell that they belong to different FEs (do they), because the rule is that contiguous tokens of the same type (Competizione, in this case) belong to the same FE.

So what I have at the moment is this, which I assume is not the expected result:

53 0 Ha VER:pres avere Attività O 53 1 giocato VER:pper giocare Attività B-LU 53 2 nella PRE:det nel Attività O 53 3 massima ADJ massimo Attività B-Competizione1_Attività 53 4 serie NOM serie Attività I-Competizione1_Attività 53 5 dei PRE:det del Attività O 53 6 campionati NOM campionato Attività O 53 7 russo ADJ russo Attività B-Competizione2_Attività 53 8 , PON , Attività I-Competizione2_Attività 53 9 turco ADJ turco Attività I-Competizione2_Attività 53 10 , PON , Attività I-Competizione2_Attività 53 11 azero ADJ azero Attività I-Competizione2_Attività 53 12 e CON e Attività O 53 13 finlandese ADJ finlandese Attività B-Competizione3_Attività 53 14 . SENT . Attività O

The problem is that I am using column headers to find token position, so that "russo", "turco" and "azero" only appear to be contiguous. I will use the treetagger output to determine tokens' position.

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/32#issuecomment-88940827.

e-dorigatti commented 9 years ago

So there will never be a situation in which punctuation lies between tokens belonging to the same entity? Is this a safe assumption?

marfox commented 9 years ago

The assumption is derived from the examples we found so far. It may be anecdotal, so we should validate it with more sentences. I think it will naturally emerge during training/testing.

dbpedia / fact-extractor

Handle a new input format in the CrowdFlower-to-training data conversion script #32