explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.37k stars 4.33k forks source link

Custom Multilingual Tokenizer #2321

Closed tzano closed 5 years ago

tzano commented 6 years ago

I started working on adding support to Arabic Language #2314

A large number of NLP tasks require normalizing the text. For Arabic Content, it includes:

In order to reduce noise and data sparsity when training model. I was thinking of writing some normalization functions that provide a different level of orthographic normalization. I'd like to know how spaCy handle text normalization for different languages ?, what is the ideal way to include these functions. Some of these functions can be used for other languages as well (e.g: Persian, .. ).

I was thinking about the two choices. (1) One can code these and add them as exceptions under lang/ar/ or (2) leverage the new custom pipeline components. In order to better illustrate some of the use cases, I have coded one of the functionalities remove_diacritics as an extension.

import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.ar import Arabic
import re
from spacy.tokens import Doc, Span, Token

nlp = Arabic()

tokens = nlp(u'رَمَضَانُ كْرِيمٌ')

all_diacritics = u"[\u0640\u064b\u064c\u064d\u064e\u064f\u0650\u0651\u0652\u0670]"
remove_diacritics = lambda token: re.sub(all_diacritics, '', token.text)
Token.set_extension('without_diacritics', getter=remove_diacritics)
Doc.set_extension('without_diacritics', getter=remove_diacritics)

print([(token.text, token._.without_diacritics) for token in tokens])

assert tokens[0]._.without_diacritics == u"رمضان"
assert tokens[1]._.without_diacritics == u"كريم"
mohamed-okasha commented 6 years ago

Hello Tahar; I have seen the great contribution you made to support Arabic language in spaCy, I am interested in this, and I want to know what you are working on now and how I can help?

tzano commented 6 years ago

Thanks @mohamed-okasha I'd like to add other features for Arabic Language like Text Normalization, Lemmatization, .. etc. However, I am hoping that @honnibal can chime in and provide some more information on how Spacy should handle Multilingual Content, whether it should be part of SpaCy or by developing separated extensions.

khaledJabr commented 6 years ago

Hello Tahar,

I have been testing out the tokenizer, and it does not seem to handle punctuations well. Here is an example (from the example sentences), and it's corresponding tokenization output

sentence : 
 "ماهي أبرز التطورات السياسية، الأمنية والاجتماعية في العالم ؟"
output tokens : 
ماهي
أبرز
التطورات
السياسية،
الأمنية
والاجتماعية
في
العالم
؟ 

Here is another more comprehensive example where it does not deal well with commas, quotations, and full stop:

sentence :

'ومثلما يبدو الاحتفال بعيدين في ذات اليوم أمراً فريدا، فإن السجالات التي تدور حول "ثورة 25 يناير" وما تبقى منها، وما تحقق من أهدافها، لا تخلو من الخصوصية والاستثناء. فالبعض يراها تتقدم، وتحقق أهدافها، والبعض الآخر يراها مختطفة، ومجهضة، وتأتي الآراء غالبا، حسب موقف صاحبا من السلطة الحاكمة حاليا. وما بين هذا وذاك، تتجلى مشاهد عالقة في الأذهان، وموثقة بالصورة ومقاطع الفيديو، لا يمكن إنكارها، أو نفي أنها أصدق ما تبقى من هذه الحدث السياسي والاجتماعي الفريد من نوعه. '

tokens output : 
ومثلما
يبدو
الاحتفال
بعيدين
في
ذات
اليوم
أمراً
فريدا،
فإن
السجالات
التي
تدور
حول
"ثورة
25
يناير"
وما
تبقى
منها،
وما
تحقق
من
أهدافها،
لا
تخلو
من
الخصوصية
والاستثناء.
فالبعض
يراها
تتقدم،
وتحقق
أهدافها،
والبعض
الآخر
يراها
مختطفة،
ومجهضة،
وتأتي
الآراء
غالبا،
حسب
موقف
صاحبا
من
السلطة
الحاكمة
حاليا.
وما
بين
هذا
وذاك،
تتجلى
مشاهد
عالقة
في
الأذهان،
وموثقة
بالصورة
ومقاطع
الفيديو،
لا
يمكن
إنكارها،
أو
نفي
أنها
أصدق
ما
تبقى
من
هذه
الحدث
السياسي
والاجتماعي
الفريد
من
نوعه.

I tried looking into it, and it seems that comma, quotations marks, and full stope are already addressed in char_classes.py, however, they are not dealt with accordingly. Any ideas what could be going wrong?

tzano commented 6 years ago

Hi @khaledJabr

I think you were initialising a tokenizer using only the nlp object's vocab nlp = Tokenizer(nlp.vocab), and you were not using the tokenization rules. In order to apply Arabic tokenization rules, you can load arabic tokenizer nlp = Arabic() and then simply process the text by calling nlp.

Moreover, you can filter out stop words and punctuations by checking the field token.is_punct & token.is_stop

import spacy 
from spacy.lang.ar import Arabic
from spacy.tokenizer import Tokenizer
from pprint import pprint

nlp = Arabic()
text =  "ماهي أبرز التطورات السياسية، الأمنية والاجتماعية في العالم ؟"
tokens  = nlp(text)

results = [token for i, token in enumerate(tokens) if not token.is_stop and not token.is_punct]

assert results == [ماهي, أبرز, التطورات, السياسية, الأمنية, والاجتماعية, العالم]

In case you want to normalize Arabic content to deal with special cases (remove diacritics, dealing with some inconsistent variations, .. etc ), you can check out an Arabic Custom tokenizer (Spacy Component) that can be found as part of Daysam to process and parse Arabic text.

khaledJabr commented 6 years ago

That solved it. Thanks!

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.