NLP-CISUC / NLPyPort

MIT License
23 stars 15 forks source link

file instead of text #2

Open tattoodobem opened 5 years ago

tattoodobem commented 5 years ago

Another question: Why am i forced to give a file instead of giving a text? imagine i get the text from a webservice, i want to feed it to the pipeline without having first to save it to a file. Also it seems i can't tokenize by sentence and then tokenize each sentence separately. Do you think the nltk sentence tokenizer works ok for portuguese?

jdportugal commented 5 years ago

Hi, thanks for you question I added parts of a more recent version that is still being tested but added the methods needed in order for your to feed a string(or text) or a list of tokens to the pipeline. You just have to comment line 133 and uncomment 140 and 141 in the FullPipeline.py.

I don't quite understand what you mean by:

Also it seems i can't tokenize by sentence and then tokenize each sentence separately.

Can you explain better?

In regards to the results of the NLTK tokenizer for the portuguese, in the article mentioned in the readme (http://drops.dagstuhl.de/opus/volltexte/2019/10885/) we have found that the results of the base NLTK when compared to a manually annotated text were in the 83% per token accuracy, and after adding the additional modules of the NLPyPort the got to 90%. Currently, due to changes made after this article, these are in the 92%, so we believe the NLPyPort tokenizer is good for tokenization.

tattoodobem commented 5 years ago

First off all thank you for your hard work, and from taking you time to answer. what i mean with

Also it seems i can't tokenize by sentence and then tokenize each sentence separately.

You have a word and a sentence tokenizer in nltk. It would be nice to be able to tokenize by sentence if one needs it. And if you have any opinion about it's implementation. i used one from their documentation:

NLTK's data collection includes a trained model for Portuguese sentence segmentation, which can be loaded as follows. It is faster to load a trained model than to retrain it.

sent_tokenizer=nltk.data.load('tokenizers/punkt/portuguese.pickle')

And if you allow me: Do you think you could obtain better results with better corpus? Are there better corpus?

jdportugal commented 5 years ago

Hi, In the tokenisation phase no other corpus were tested and thus there might be better corpus, but the results obtained for the Bosque corpus, a manually annotated corpus used in testing have reached 94% if you rejoin entities and feed them back to the tokenizer, so the results are pretty good already.

However, if you do tests with other corpus and get better results please share them, and I'll change the pipeline

igorkf commented 4 years ago

Hi, thanks for you question I added parts of a more recent version that is still being tested but added the methods needed in order for your to feed a string(or text) or a list of tokens to the pipeline. You just have to comment line 133 and uncomment 140 and 141 in the FullPipeline.py.

I don't quite understand what you mean by:

Also it seems i can't tokenize by sentence and then tokenize each sentence separately.

Can you explain better?

In regards to the results of the NLTK tokenizer for the portuguese, in the article mentioned in the readme (http://drops.dagstuhl.de/opus/volltexte/2019/10885/) we have found that the results of the base NLTK when compared to a manually annotated text were in the 83% per token accuracy, and after adding the additional modules of the NLPyPort the got to 90%. Currently, due to changes made after this article, these are in the 92%, so we believe the NLPyPort tokenizer is good for tokenization.

First, thank you very much for this project! I'm having troubles when doing lemmatization on portuguese.

I changed the lines as you say (I forked the repo and changed the FullPipeline.py, but now I got this error: NameError: name 'Text' is not defined

So from where Text() is created? I can't figure out.

jdportugal commented 4 years ago

HI, A lot of changes have been made since this post was made and there were many updates to the pipeline and that's why you are getting that error I would recommend the installation of the pipeline via pypi (https://pypi.org/project/NLPyPort/) since this is the most recent and has documentation about how to fix most errors people have been encountering when running. Give it a go and if you still have problems say something!