Convert to Raganato not generating keys

getalp / UFSAC

UFSAC is a resource containing all WordNet Sense Annotated Corpora, and a Java library for manipulating them

MIT License

37 stars 4 forks source link

Convert to Raganato not generating keys #3

Closed danlou closed 5 years ago

danlou commented 5 years ago

Hi,

I'm interested in using your scripts to convert MASC to the format used in Raganato's framework, but it seems there some issue to be resolved.

I'm running the command: sh UFSAC/scripts/convert_to_raganato.sh --input masc.xml --output masc_converted.xml

This generates two files, as expected:

masc_converted.xml.data.xml
masc_converted.xml.gold.key.txt

But the key file is empty, and it doesn't look like the data file contains any key references.

Do you think this can be solved?

Your work in converting all this corpora into the same format, and all mapped to WN3.0, is a much appreciated effort btw!

Thanks, Daniel

loic-vial commented 5 years ago

Hi,

Thank you for your interest in our work ! You're right, there is a bug when trying to convert a corpus without "id=" tags on words. I will try to fix it this afternoon, by generating an id on documents, sentences and sense annoted words, during the conversion process :)

I'll keep you inform as soon as it's ready !

loic-vial commented 5 years ago

@danlou The bug is now fixed ! I added a "target_X" id to every sense annotated word during the conversion process (don't manage documents and sentences in the end, unless there is a real need).

Please "git pull", "./java/compile.sh", and tell me if everything works for you !

danlou commented 5 years ago

Thanks for solving this issue so fast! It worked.

I've now found a couple of escaping errors in the xml for some characters (e.g. &, <, >), but managed to fix those manually in couple of seconds (with find/replace all).