UCCA postprocessing: detokenization

coli-saar / am-parser

Modular implementation of an AM dependency parser in AllenNLP.

Apache License 2.0

30 stars 10 forks source link

UCCA postprocessing: detokenization #41

Closed namednil closed 5 years ago

namednil commented 5 years ago

I can't run the postprocessing UCCA script because it cannot find the module nltk.tokenize.moses, I'm using the latest nltk version. Does it have to do with this: https://github.com/lyeoni/nlp-tutorial/issues/2?

Also, do we really need "detokenization"? We carry around the input string.

mariomgmn commented 5 years ago

Possibly, but yes you're right, after yesterday's modification we don't really need the detokenisation. I see in another issue that you already have output. Did you modify the script to not import the detokeniser?

namednil commented 5 years ago

I just installed the moses tokenization package from the source that was referenced in the issue I linked.

mariomgmn commented 5 years ago

Okay. I'll pull and then modify the code to remove the detokeniser and push that.

alexanderkoller commented 5 years ago

Hi, can someone elaborate on this? We should be very cautious about using NLTK in our code because it is not on the whitelist.

mariomgmn commented 5 years ago

Hi, yes. I imported an NLTK detokenizer in one of the scripts. However, we're not currently using it or the code that depends on it given that we are carrying the input string. I'll remove the code that uses it to avoid any confusion.

alexanderkoller commented 5 years ago

Yes, please do and then close this issue.

Is this what causes #50?

namednil commented 5 years ago

I just wanted to mention that using nltk should be fine as long as we don't use anything that is pretrained or any nltk corpus, because the whitelist only talks about "constraints on which third-party data or pre-trained models can be used in addition to the resources distributed by the task organizers."

mariomgmn commented 5 years ago

Is this what causes #50? No, we're not using the detokenizer any more. We're pulling the input from the mrp files as-is and using the tokenization from the companion data as well and using them both in the alto corpus.