ArneBinder / dialam-2024-shared-task

see http://dialam.arg.tech/
0 stars 0 forks source link

convert prediction to shared task format #31

Closed ArneBinder closed 2 months ago

ArneBinder commented 2 months ago

This PR implements the following methods:

The predicted output can be loaded via

from src.serializer import JsonSerializer
from src.document.types import (
    TextDocumentWithLabeledEntitiesAndNaryRelations,
)

docs = JsonSerializer.read(
    path="data/prediction",
    file_name="test_documents.jsonl",
    document_type=TextDocumentWithLabeledEntitiesAndNaryRelations,
)
# just process the first document for now
document_with_prediction = docs[0]

Then, the documents can be converted:

from dataset_builders.pie.dialam2024.dialam2024 import (
    convert_to_example,
    unmerge_relations,
) 

# convert to SimplifiedDialAM2024Document
unmerged_document = unmerge_relations(document_with_prediction)
# convert to shared task format
result = convert_to_example(unmerged_document, use_predictions=True)

And save to file:

import json

# get and remove the doc id, it should not be part of the file content
doc_id = result.pop("id")
with open(f"{doc_id}.json", "w") as f:
    json.dump(result, f, indent=2)

Unfortunately, this requires some more metadata (original nodes, edges, and locations) which was previously not correctly added to the document, so it is necessary to create the predictions with this PR branch to get the conversion correctly working.

~NOTE: THIS IS NOT YET FULLY TESTED!~

tanikina commented 2 months ago

Thanks a lot for the update and the detailed instructions! I tested this code on the predictions from xlm-roberta-large (based on this model, seed 1) and the code above generates correctly looking nodesets. There were only two (and a half) issues: 1) test_map2.json could not be processed because of the following error: ValueError: Expected all roles of n-ary relation s_nodes:Default Rephrase to be prefixed with s_nodes:, got ya_s2ta_nodes:source. I found the following annotation in nary_relations: {"arguments": [3935837313197648457, 488896451309244507], "roles": ["ya_s2ta_nodes:source", "ya_s2ta_nodes:target"], "label": "s_nodes:Default Rephrase", "score": 0.28063133358955383, "_id": 1239322064654169658} 2) I also visualized generated nodesets to see if they look fine and, in general, they do. However, in test_map0.json we have a YA node (the one on the top) that does not connect TA-node to any other node and I am not sure whether it should be there: nodeset0 gv 3) We also have some rev-relations in the output (e.g., test_map8.json), I suppose, we should re-reverse them?

ArneBinder commented 2 months ago

thanks for testing that out! regarding

  1. we should just discard these instances instead of throwing an error (maybe log a warning) EDIT: fixed in https://github.com/ArneBinder/dialam-2024-shared-task/pull/31/commits/a36667dd93996da37c50038d055e9662c74d69a5
  2. not really sure where this comes from... do you have any clue what causes this / where it gets lost? EDIT: This happens when we predict NONE for an S-node, but not for its anchor YA-node. fixed in https://github.com/ArneBinder/dialam-2024-shared-task/pull/31/commits/3852b51b7efbece95c37769dc5ce2bbb45077b1a
  3. I totally forgot to implement the reversal of the -rev labeled relations, this just needs to be done EDIT: fixed in https://github.com/ArneBinder/dialam-2024-shared-task/pull/31/commits/054da1658aa88feadc2040e3facc14b0f6ef13a7
ArneBinder commented 2 months ago

Since our output was approved by the organizers for the nodesets from sample_test, this is finally ready.