How can you do the tokenization?

CAMeL-Lab / camel_tools

A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.

MIT License

409 stars 73 forks source link

How can you do the tokenization? #10

Closed abdulrahimq closed 4 years ago

abdulrahimq commented 4 years ago

I have been reading the code and I'm not sure where to go from here.

db = camel_tools.calima_star.database.CalimaStarDB.builtin_db() analyzer = camel_tools.calima_star.analyzer.CalimaStarAnalyzer(db) disambiguator = camel_tools.disambig.mle.MLEDisambiguator(analyzer) morph = MorphologicalTokenizer(disambiguator) morhph.tokenize('ذهب الولد')

RESULT This gives me ['ذ', 'ه', 'بِ', ' ', 'أَ', 'لِ', 'وَ', 'لِ', 'د', ' ']

I'm not sure what am I doing wrong and would appreciate some guidance.

Also when I go to this page https://camel-tools.readthedocs.io/en/v0.3.dev0/reference/calima_star_features.html to check what is the tokenization scheme used, I can't find "The tokenization scheme to use. Defaults to 'atbtok'." so I am not sure what atbtok means here?

owo commented 4 years ago

Hi @abdulrahimq,

The morphological tokenizer takes a list of words as input as documented here.

So you should probably be doing morph.tokenize(['ذهب', 'الولد']).

If your text hasn't been white-space and punctuation separated yet, you can use the simple_word_tokenize function to do that.

The tokenization scheme determines how clitics are separated from the original word. In this case, 'atbtok' (or ATB tokenization) means that we will produce tokenization based on the Penn Arabic Treebank's tokenization scheme. We have other tokenization schemes that can be used and we will be documenting those soon, but for now 'd3tok' (D3 tokenization) will give you a tangible result for the example you provided.

Also, please make sure you're using the current release documentation, the link you provided is an older version.

I hope that helps.

abdulrahimq commented 4 years ago

Yeah, that was really helpful sorry for missing the last part but it is not straightforward to use this library.

I have another question. Here I used this sentence which is used in Nizar's paper MADA+TOKAN but the result for the word جولته are not the same as the paper. morhph = MorphologicalTokenizer(disambiguator, 'd3tok') morhph.tokenize(["تركيا" , "الى " ,"بزيارة" , "جولته" ,"الرئيس" ,"وسينهى"]) ['تُرْكِيّاً' , 'إِلَى' , 'بِ+_زِيارَة' , 'جَوْل' , 'ال+_رَئِيس' , 'وَ+_سَ+_يُنْهِي']

Thank you for your help! Also if there is something I can help with here maybe write a small tutorial for how to you the library please let me know.

owo commented 4 years ago

I have another question. Here I used this sentence which is used in Nizar's paper MADA+TOKAN but the result for the word جولته are not the same as the paper. morhph = MorphologicalTokenizer(disambiguator, 'd3tok') morhph.tokenize(["تركيا" , "الى " ,"بزيارة" , "جولته" ,"الرئيس" ,"وسينهى"]) ['تُرْكِيّاً' , 'إِلَى' , 'بِ+_زِيارَة' , 'جَوْل' , 'ال+_رَئِيس' , 'وَ+_سَ+_يُنْهِي']

Yes, that's a bug in the analyzer database. We will fix it soon and let you know. Thanks for letting us know.

owo commented 4 years ago

Hi @abdulrahimq,

I just uploaded a fix to the analyzer that should fix the tokenization issues you were having. To update to the latest version, just do pip install --upgrade camel-tools.

Also note that MorphologicalTokenizer takes an additional argument called diac that determines whether the output tokens are diacritized or not. By default, this is set to False which returns undiacritized tokens (which as far as I know, is the general case for tokenization).

To get diacritized output as before, just write morph = MorphologicalTokenizer(disambiguator, 'd3tok', diac=True).

Also, keep in mind that, at the moment, the underlying model for choosing the right tokenization is very basic and not that accurate. We will be adding better models in the near future so please keep an eye out for any updates.

I hope that helps.

abdulrahimq commented 4 years ago

Hi @owo

Thank you for taking a look at this and good to know the diac option. This is helpful and I'll make sure to keep checking for updates on this library.

Best, Abed