ArneBinder / dialam-2024-shared-task

see http://dialam.arg.tech/
0 stars 0 forks source link

Preprocessing fixes #25

Closed tanikina closed 5 months ago

tanikina commented 5 months ago

This addresses the following issues:

  1. Issue https://github.com/ArneBinder/dialam-2024-shared-task/issues/24#issue-2259055552 with the following warnings:
    [WARNING] - doc.id=17945: Skipping invalid example, cannot get argument token slices for {LabeledSpan(start=2013, end=2051, label='L', score=1.0): 'Claire Cooper : I am a syke ol gist. \xa0', ...}

    Problematic node: {"nodeID":"513425","text":"Claire Cooper : I am a syke ol gist. \u00a0","type":"L","timestamp":"2020-05-28 20:37:40"} Because of the spaces at the end of the string we are getting a mismatch here since the tokenizer consumes all extra spaces (there is no mapping for them in char_to_token_mapper) and we get the wrong offsets. Note that extra spaces do not necessarily appear only at the end of a string, e.g. in the following example we have two spaces after colon: Andy Burnham : in the last week he has sounded more like the Chancellor of the Exchequer than Health Secretary

Note: we have " \xa0" in the warning and " \u00a0" in the original node text because of the JSON conversion:

>>> json.loads(json.dumps("\u00a0"))
'\xa0'

Proposed fix: https://github.com/ArneBinder/dialam-2024-shared-task/commit/b71b0f9de3def964a8880e7f32ec5ba2def9fb2e

  1. Issue https://github.com/ArneBinder/dialam-2024-shared-task/issues/20#issuecomment-2072503720 We are getting multiple -rev suffixes when the same RA-node appears in multiple relations that need to be reverted, see an example from nodeset 18471: nodeset-18471-rev-rev-ra-node Here we revert the "Default Inference" relation twice but we need to add the "-rev" suffix only once. Proposed fix: https://github.com/ArneBinder/dialam-2024-shared-task/commit/e3c566a7a0ac5318879584c96c97fcbff5b3f744

  2. Loop warnings when processing the data, e.g. nodeset_id=18321: Detected loop nodes: {'543218', '543222', '543226'} We can reduce the amount of such warnings from 34 to 19 by removing obvious self-loops (which are related to the annotation problems as discussed with the shared task organizers here). Proposed fix: https://github.com/ArneBinder/dialam-2024-shared-task/commit/e91ca5c74031a4fdb105fa0821c17aae3426bd29

ArneBinder commented 5 months ago

statistics after the fix

when training with experiment=dialam2024_merged_relations

train

s_nodes:Default Conflict s_nodes:Default Inference s_nodes:Default Inference-rev s_nodes:Default Rephrase s_nodes:NONE ya_i2l_nodes:Agreeing ya_i2l_nodes:Arguing ya_i2l_nodes:Asserting ya_i2l_nodes:Assertive Questioning ya_i2l_nodes:Challenging ya_i2l_nodes:Default Illocuting ya_i2l_nodes:NONE ya_i2l_nodes:Pure Questioning ya_i2l_nodes:Restating ya_i2l_nodes:Rhetorical Questioning ya_s2ta_nodes:Agreeing ya_s2ta_nodes:Arguing ya_s2ta_nodes:Asserting ya_s2ta_nodes:Challenging ya_s2ta_nodes:Default Illocuting ya_s2ta_nodes:Disagreeing ya_s2ta_nodes:NONE ya_s2ta_nodes:Pure Questioning ya_s2ta_nodes:Restating ya_s2ta_nodes:Rhetorical Questioning
available 749 1981 1801 3673 8052 10 4 15695 198 22 22 288 951 5 189 18 3654 15 33 447 704 9200 5 3230 2
used 749 1981 1801 3673 8052 10 4 15695 198 22 22 288 951 5 189 18 3654 15 33 447 704 9200 5 3230 2

validation

s_nodes:Default Conflict s_nodes:Default Inference s_nodes:Default Inference-rev s_nodes:Default Rephrase s_nodes:NONE ya_i2l_nodes:Agreeing ya_i2l_nodes:Asserting ya_i2l_nodes:Assertive Questioning ya_i2l_nodes:Challenging ya_i2l_nodes:Default Illocuting ya_i2l_nodes:NONE ya_i2l_nodes:Pure Questioning ya_i2l_nodes:Rhetorical Questioning ya_s2ta_nodes:Agreeing ya_s2ta_nodes:Arguing ya_s2ta_nodes:Challenging ya_s2ta_nodes:Default Illocuting ya_s2ta_nodes:Disagreeing ya_s2ta_nodes:NONE ya_s2ta_nodes:Pure Questioning ya_s2ta_nodes:Restating
available 83 214 202 404 877 1 1704 19 2 5 39 111 18 1 408 2 57 80 1013 2 348
used 83 214 202 404 877 1 1704 19 2 5 39 111 18 1 408 2 57 80 1013 2 348