Train a multitask parser

danielhers / tupa

Transition-based UCCA Parser

https://danielhers.github.io/tupa

GNU General Public License v3.0

72 stars 24 forks source link

Train a multitask parser #55

Closed baozuyi closed 5 years ago

baozuyi commented 5 years ago

Hi, Following the Readme, I can train the parser on Wiki-sentences. But I wonder how to train a multitask parser proposed in the paper of ACL2018. Thanks.

danielhers commented 5 years ago

Hi @baozuyi . To train a multitask model, simply pass directories/filenames of multiple formats to the --train argument. The supported formats are .conllu, .sdp and .amr. Let me know if it works or if you have more questions.

danielhers commented 5 years ago

Here is where you can get data for other formats:

.conllu: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2895
.sdp: http://sdp.delph-in.net/osdp-12.tgz
.amr: https://amr.isi.edu/download.html

baozuyi commented 5 years ago

@danielhers Thanks, it works well.

BTW, the training seems to be time-consuming. How long will it take?

danielhers commented 5 years ago

It depends on your hardware, but with the full training data for all tasks it can take between 1 to 3 weeks to reach the best dev score.

baozuyi commented 5 years ago

Thanks @danielhers

mcsuy commented 5 years ago

Hi. I have a question regarding the amr files. Are the data files really meant to have .amr extensions? I ask this because the little prince corpus I downloaded was in txt format.

danielhers commented 5 years ago

You can just rename it to .amr and it will be recognized as AMR format. Since the format is recognized by the suffix, there's no way to distinguish between an AMR file and a plain text file otherwise...