Problem with the '-' character between word and postag in oracle txt files

sb-b commented 6 years ago

Hi,

When using the command:

java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c training.conll > trainingOracle.txt

ParserOracleArcStdWithSwap.jar puts '-' character between words and their postags in the trainingOracle.txt file. However, in the current version of UD treebanks, some treebanks include xpos values that include multiple '-' characters. So, the oracle files look like this:

[][τὰ-DET_l-p---na-, γὰρ-ADV_d--------, πρὸ-ADP_r--------, αὐτῶν-PRON_p-p---ng-, καὶ-CCONJ_c--------, τὰ-DET_l-p---na-,..., ROOT-ROOT]

When these oracle files are being parsed in load_correct_actions and load_correct_actionsDev methods inside c2.h file, the words and their pos-tags cannot be extracted correctly.

Can it be possible to put another character like '#' between words and postags when creating the oracle txt files? I have tried to change the '-' character with '#' character by decompiling the class files inside ParserOracleArcStdWithSwap.jar but couldn't succeed it.

Thank you,

Betul

miguelballesteros commented 6 years ago

Yeah put another character and problem solved.

sb-b commented 6 years ago

I was just asking you to change the '-' character because I thought you have the source code and can easily do this job. I couldn't convert the class files to java files, replace the character and then create a jar file again without errors. Thank you anyways.

miguelballesteros commented 6 years ago

I don't have the time to to do that. I'm sorry. M.

clab / lstm-parser

Problem with the '-' character between word and postag in oracle txt files #29