Some operations/functions usage need more clarification please

Sue-Fwl commented 4 years ago

Greetings,

Appreciate the efforts of doing such a grand project and making it accessible to other researchers,

I'm currently trying out the tools to use it in cleaning and extracting features off my data and I'm having trouble using some of the functionalities because their documentation isn't published yet or isn't clear enough.

"camel_arclean" I couldn't find it's class or the way to invoke it

Utility arclean Cleans Arabic text by

Deleting characters that are not in Arabic, ASCII, or Latin-1. Converting all spacing characters to an ASCII space character. Converting Indic digits into Arabic digits. Converting extended Arabic letters into basic Arabic letters. Converting 1-char presentation froms into simple basic forms.

"dialectid " I've tried to run the example provided but I'm getting this error

from camel_tools.dialectid import DialectIdentifier

did = DialectIdentifier.pretrained()

sentences = [
    'مال الهوى و مالي شكون اللي جابني ليك  ما كنت انايا ف حالي بلاو قلبي يانا بيك',
    'بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم'
]

predictions = did.predict(sentences)
top_dialects = [p.top for p in predictions]

File "Anaconda3\lib\site-packages\camel_tools\dialectid\__init__.py", line 34, in <module>
    import kenlm

ModuleNotFoundError: No module named 'kenlm'

"CalimaStarAnalyzer" I'm getting POS=noun_prop for all words, and never getting a stem. I'm depending on the first returned list of the list of lists that is returned by the functions, even though I checked the rest and didn't find any right analysis. I used it on my data and used it on the example provided but couldn't figure out what's wrong. for example the verb 'مشيت' when analyzed gives a number of possible tags but none of them is 'verb'

text = 'مشيت في الشارع' #example provided in doc
text2 = 'مقتل ضابط وجندي إسرائيليين في عملية دهس بالضفة الغربية'
from camel_tools.calima_star.database import CalimaStarDB
from camel_tools.calima_star.analyzer import CalimaStarAnalyzer

db = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\calima-msa-1.0.db', 'a')
# Create analyzer with no backoff
analyzer = CalimaStarAnalyzer(db)
# Create analyzer with NOAN_ALL backoff
#analyzer = CalimaStarAnalyzer(db, 'NOAN_ALL')
# or
analyzer = CalimaStarAnalyzer(db, backoff='NOAN_ALL')

# To analyze a word, we can use the analyze() method
analyses1 = analyzer.analyze_words(text.split())
analyses = analyzer.analyze('مقتل') # All results=مقتل/NOUN_PROP

A snippet of returned analysis

{'diac': 'مقتل',
 'lex': 'مقتل_0',
 'bw': 'مقتل/NOUN_PROP',
 'gloss': 'NO_ANALYSIS',
 'pos': 'noun_prop',
 'prc3': '0',
 'prc2': '0',
 'prc1': '0',
 'prc0': '0',
 'per': 'na',
 'asp': 'na',
 'vox': 'na',
 'mod': 'na',
 'gen': 'm',
 'num': 's',
 'stt': 'd',
 'cas': 'u',
 'enc0': '0',
 'rat': 'i',
 'source': 'backoff',
 'form_gen': 'm',
 'form_num': 's',
 'catib6': '+NOM+',
 'ud': '+PROPN+',
 'pos_freq': -1.047404,
 'pos_lex_freq': -99.0,
 'lex_freq': -99.0,
 'root': '',
 'pattern': '',
 'caphi': 'm_q_t_l',
 'atbtok': 'مقتل',
 'd2tok': 'مقتل',
 'd1tok': 'مقتل',
 'atbseg': 'مقتل',
 'd3tok': 'مقتل',
 'd3seg': 'مقتل',
 'd2seg': 'مقتل',
 'd1seg': 'مقتل',
 'stem': 'مقتل',
 'stemgloss': 'NO_ANALYSIS',
 'stemcat': 'N0'}

"Generate lemma and features (CalimaStarReinflector)" I couldn't find the file of the lemma db, and it wasn't clear the way of constructing the features dictionary.

"CalimaStarGenerator" Same issue as above.

" Morphological Analyzer " I'm not getting any analysis results, and the morphological tokenizer 'tokenize' is giving the same results as the 'simple_word_tokenize' in tokenizers

from camel_tools.tokenizers import morphological
from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.calima_star.analyzer import CalimaStarAnalyzer
from camel_tools.calima_star.database import CalimaStarDB

# Initialize database in reinflection mode
db_disa = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\morphology_db\\almor-msa-ext\\morphology.db','r')
disa = MLEDisambiguator(CalimaStarAnalyzer(db_disa, backoff='NONE', norm_map='<camel_tools.utils.charmap.CharMapper object>', strict_digit=False, cache_size=0), mle_path=None)

disa_sentence = disa.disambiguate(text_token)#,top=1)

disa_word = disa.disambiguate_word(text_token, word_ndx =0) #,top=1)

res_morph = morphological.MorphologicalTokenizer(disa, scheme='atbtok', split=True, diac=False) #res_morph.scheme_set() #{'atbtok', 'd3tok'}

tokenized_morph = res_morph.tokenize(text_token)  #

text_token = ['مقتل',
 'ضابط',
 'وجندي',
 'إسرائيليين',
 'في',
 'عملية',
 'دهس',
 'بالضفة',
 'الغربية']

DisambiguatedWord(word='مقتل', analyses=[]),
 DisambiguatedWord(word='ضابط', analyses=[]),
 DisambiguatedWord(word='و', analyses=[]),
 DisambiguatedWord(word='جندي', analyses=[]),
 DisambiguatedWord(word='إسرائيليين', analyses=[]),
 DisambiguatedWord(word='في', analyses=[]),
 DisambiguatedWord(word='عملية', analyses=[]),
 DisambiguatedWord(word='دهس', analyses=[]),
 DisambiguatedWord(word='بالضفة', analyses=[]),
 DisambiguatedWord(word='الغربية', analyses=[])]

" MLEDisambiguator "

from camel_tools.disambig.mle import MLEDisambiguator

mle = MLEDisambiguator.pretrained()

sentence = 'الطفلان أكلا الطعام معاً وأخذا 5 تفاحات'.split()
disambig = mle.disambiguate(sentence)

# Let's, for example, use the top disambiguations to generate a diacritized
# version of the above sentence.
# Note that, in practice, you'll need to make sure that each word has a
# non-zero list of analyses.
diacritized = [d.analyses[0].analysis['diac'] for d in disambig]
print(' '.join(diacritized))

I'm getting results on some nouns, but so far I had no luck with POS or other features such as form_num, gen, mod when it comes to plurals, ones that are connected to a pronoun or verbs.. etc

#print
الطفلان اكلا الطَعامِ مَعاً واخذا 5 تفاحات

#Analysis
[DisambiguatedWord(word='الطفلان', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'الطفلان', 'lex': 'الطفلان_0', 'bw': 'الطفلان/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': '2_l_t._f_l_aa_n', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'الطفلان', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})]),
 DisambiguatedWord(word='أكلا', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'اكلا', 'lex': 'اكلا_0', 'bw': 'اكلا/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': '2_k_l_aa', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'اكلا', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})]),
 DisambiguatedWord(word='الطعام', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'الطَعامِ', 'lex': 'طَعام_1', 'bw': 'ال/DET+طَعام/NOUN+ِ/CASE_DEF_GEN', 'gloss': 'the+food+[def.gen.]', 'pos': 'noun', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': 'Al_det', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'form_gen': 'm', 'gen': 'm', 'form_num': 's', 'num': 's', 'stt': 'd', 'cas': 'g', 'enc0': '0', 'rat': 'i', 'source': 'lex', 'stem': 'طَعام', 'stemcat': 'N', 'stemgloss': 'food', 'caphi': '2_a_t._t._a_3_aa_m_i', 'catib6': 'PRT+NOM+', 'ud': 'DET+NOUN+', 'root': 'ط.ع.م', 'pattern': 'ال1َ2ا3ِ', 'd3seg': 'ال+_طَعامِ', 'atbseg': 'الطَعامِ', 'd2seg': 'الطَعامِ', 'd1seg': 'الطَعامِ', 'd1tok': 'الطَّعامِ', 'd2tok': 'الطَّعامِ', 'atbtok': 'الطَّعامِ', 'd3tok': 'ال+_طَعامِ', 'pos_freq': '-0.4344233', 'lex_freq': '-4.660188', 'pos_lex_freq': '-4.660188'})]),
 DisambiguatedWord(word='معاً', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'مَعاً', 'lex': 'مَعاً_1', 'bw': 'مَعاً/ADV', 'gloss': 'together', 'pos': 'adv', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'form_gen': '-', 'gen': '-', 'form_num': '-', 'num': '-', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'y', 'source': 'lex', 'stem': 'مَعاً', 'stemcat': 'FW-Wa', 'stemgloss': 'together', 'caphi': 'm_a_3_a_n', 'catib6': '++', 'ud': '++', 'root': 'مع', 'pattern': '1َ2اً', 'd3seg': 'مَعاً', 'atbseg': 'مَعاً', 'd2seg': 'مَعاً', 'd1seg': 'مَعاً', 'd1tok': 'مَعاً', 'd2tok': 'مَعاً', 'atbtok': 'مَعاً', 'd3tok': 'مَعاً', 'pos_freq': '-99.0', 'lex_freq': '-99.0', 'pos_lex_freq': '-99.0'})]),
 DisambiguatedWord(word='وأخذا', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'واخذا', 'lex': 'واخذا_0', 'bw': 'واخذا/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': 'w_aa_kh_dh_aa', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'واخذا', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})]),
 DisambiguatedWord(word='5', analyses=[ScoredAnalysis(score=1.0, analysis={'pos': 'digit', 'diac': '5', 'lex': '5_0', 'bw': '5/NOUN_NUM', 'gloss': '5', 'prc3': 'na', 'prc2': 'na', 'prc1': 'na', 'prc0': 'na', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'gen': 'na', 'num': 'na', 'stt': 'na', 'cas': 'na', 'enc0': 'na', 'rat': 'na', 'source': 'digit', 'form_gen': 'na', 'form_num': 'na', 'catib6': 'NOM', 'ud': 'NUM', 'd3seg': '5', 'atbseg': '5', 'd2seg': '5', 'd1seg': '5', 'd1tok': '5', 'd2tok': '5', 'atbtok': '5', 'd3tok': '5', 'pos_freq': -99.0, 'pos_lex_freq': -99.0, 'lex_freq': -99.0, 'root': 'DIGIT', 'pattern': 'DIGIT', 'caphi': 'DIGIT', 'stem': '5', 'stemgloss': '5', 'stemcat': None})]),
 DisambiguatedWord(word='تفاحات', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'تفاحات', 'lex': 'تفاحات_0', 'bw': 'تفاحات/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': 't_f_aa_7_aa_t', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'تفاحات', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})])]

owo commented 4 years ago

Hi @Sue-Fwl,

Here are the answers to your queries:

"camel_arclean" I couldn't find it's class or the way to invoke it

This is just uses a CharMapper instance with the 'arclean' scheme. You can instantiate one as follows:

from camel_tools.utils.charmap import CharMapper

arclean = CharMapper.builtin_mapper('arclean')
noisy_text = '...'  # This is the noisy text we wish to clean
cleaned_text = arclean(dirty_text)

"dialectid " I've tried to run the example provided but I'm getting this error

So there are a couple of issues here:

From the paths in the error message, it seems you are using camel-tools from Windows. Sadly, one of the dependencies used by the Dialect ID component (kenlm) can't be installed on a "plain" Windows setup. See my response here for more info.
If you are using a setup where kenlm can be installed, there is an existing issue where the pr-trained models can't be loaded. I'm working on solving that at the moment.

Regarding issues with CalimaStarAnalyzer, CalimaStarReinflector and CalimaStarGenerator, I'm a bit confused about your set up.

Can you tell me how you installed camel-tools? Did you pip/conda install it or did you install it from source from the master branch?

When you mention the following line:

db = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\calima-msa-1.0.db', 'a')

Where did you get this database from?

If you installed via pip/conda, there is only one database provided and is part of the package itself, and would be located in E:\Anaconda3\Lib\site-packages\camel_tools\calima_star\databases\almor-msa\almor-msa-r13.db.

On the other hand, if you installed from source from the master branch, as per the README, the data should be in a separate directory from the python package.

Assuming you installed the data at ~\AppData\Roaming\camel_tools\, then the default database should be located at ~\AppData\Roaming\camel_tools\data\morphology_db\almor_msa_ext\morphology.db.

As for issues with MLEDisambiguator there are a couple of things:

First, I just found an issue in the MLEDisambiguator code that wasn't correctly loading the database file. I'll fix this and update the repo ASAP. This is one of the disambiguation issues you have, where everything is being analyzed as NOUN_PROP.

Second, the following line is incorrect:

disa = MLEDisambiguator(CalimaStarAnalyzer(db_disa, backoff='NONE', norm_map='<camel_tools.utils.charmap.CharMapper object>', strict_digit=False, cache_size=0), mle_path=None)

Specifically, norm_map = '<camel_tools.utils.charmap.CharMapper object>' is incorrect here. I'm guessing you copied this directly from the documentation, which I agree is a bit confusing. All this is saying (in the documentation) is that the default value for norm_map is a pre-defined CharMapper object. Really you should just use mle = MLEDisambiguator.pretrained() if you do not want to change the default values.

Having said all that, please note that the code on the repo is still under development and that there are breaking changes coming soon to camel_tools.calima_star and all related code you have above will not work. I will also be adding more examples to clarify things.

Sorry for any frustration this has caused.

Sue-Fwl commented 4 years ago

Really appreciate your help. I did know the project is still on going, which is why I'm verifying which parts I'm misusing and which aren't ready yet. I'm currently using the NER, Transliteration, and SA functionalities on my data and I'm pleased with the results I'm getting.

Regarding issues with CalimaStarAnalyzer, CalimaStarReinflector and CalimaStarGenerator

I installed it via pip in anaconda, then when I found a new update that wasn't possible through pip -around the 26th of August- I reinstalled it from the master branch.

Where did you get this database from? 'databases\calima-msa-1.0.db'

I tried the db mentioned 'almor-msa-r13.db' and the functions didn't work properly so I search for another one in the databases folder. I retried calling CalimaStarAnalyzer with the db you mentioned but I'm still not getting proper results.

db2 = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\almor-msa\\almor-msa-r13.db', 'a')
db = CalimaStarDB('C:\\Users\\Sana trafalgar\\AppData\\Roaming\\camel_tools\\data\\morphology_db\\almor-msa-ext\\morphology.db', 'a')
analyzer = CalimaStarAnalyzer(db, 'NOAN_ALL')
analyzer2 = CalimaStarAnalyzer(db2, 'NOAN_ALL')

analyses = analyzer.analyze('مشيت')
analyses2 = analyzer2.analyze('مشيت')

Out[7]:  #both returned the exact same results
{'diac': 'مشيت', 'lex': 'مشيت_0', 'bw': 'مشيت/NOUN_PROP',  'gloss': 'NO_ANALYSIS',  'pos': 'noun_prop',  'prc3': '0',  'prc2': '0', 
 'prc1': '0', 'prc0': '0',  'per': 'na',  'asp': 'na', 'vox': 'na',  'mod': 'na',  'stt': 'i',  'cas': 'u',  'enc0': '0',  'rat': 'i',  'source': 'backoff', 
 'form_gen': '-',  'form_num': '-',   'catib6': '+NOM+',  'pos_freq': -99.0,   'gen': '-',  'lex_freq': -99.0,  'ud': '+PROPN+',  'pos_lex_freq': -99.0,  'num': '-',  'd3tok': 'NOAN',   'd2seg': 'NOAN',  'root': 'O',  'pattern': 'N1AN',  'd1seg': 'NOAN',  'd1tok': 'NOAN',  'caphi': 'm_sh_y_t',  'd3seg': 'NOAN',  'atbtok': 'NOAN',  'd2tok': 'NOAN',  'atbseg': 'NOAN',  'stem': 'مشيت',  'stemgloss': 'NO_ANALYSIS',  'stemcat': 'N0'}

For CalimaStarGenerator, I got an error message:

# tried both (morphology.db, almor-msa-r13.db) and both returned the same error
db2 = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\almor-msa\\almor-msa-r13.db', 'a')
generator2 = CalimaStarGenerator(db2)
lemma = 'مشى'
features = {"form_num":"s","gen":"m","mod":"i"}
analysesLF2 = generator2.generate(lemma, features)

Traceback (most recent call last):
  File "", line 48, in <module>
    generator2 = CalimaStarGenerator(db2)
  File "\camel_tools\calima_star\generator.py", line 59, in __init__    raise GeneratorError('DB does not support generation')

GeneratorError: DB does not support generation

For CalimaStarReinflector, I tried both dbs and got the same error too: 'ReinflectorError: DB does not support reinflection'

For MorphologicalTokenizer. After using the pretrained(), it kinda worked, but didn't detect all words:

from camel_tools.tokenizers import morphological
from camel_tools.disambig.mle import MLEDisambiguator

disa = MLEDisambiguator.pretrained()

res_morph = morphological.MorphologicalTokenizer(disa, scheme='d3tok', split=True, diac=True) #gives an error when diac is False
tokenized_morph = res_morph.tokenize(text_token)

 text_token
Out[29]: ['الطفلان', 'أكلا', 'الطعام', 'معاً', 'وأخذا', '5', 'تفاحات']
tokenized_morph
Out[30]: [['NOAN'], ['NOAN'], ['ال+', 'طَعامِ'], ['مَعاً'], ['NOAN'], ['5'], ['NOAN']]

owo commented 4 years ago

Hi @Sue-Fwl ,

Since a few things have changed since your original issue post, could you please download the latest copy of camel-tools from the master branch and re-install the camel-tools data as per the data installation instructions in the README? When re-installing from the master branch, you may need to do the following:

pip install --upgrade --force-reinstall .

First, the entire camel_tools.calima_star API has now changed to camel_tools.morphology (these are the breaking changes I mentioned at the end of my previous response). After performing the re-installation mentioned above, can you please try the updated examples in the documentation and let me know if everything works? Here are the doc links:

As for the MorphologicalTokenizer, when everything is installed correctly, I get the following output for your example:

>>> tokenized_morph
[['ال+', 'طِفْلانِ'], ['أَكَلا'], ['ال+', 'طَعامِ'], ['مَعاً'], ['وَ+', 'أَخَذا'], ['5'], ['تُفّاحات']]

The output you got is probably due to not having the data installed correctly. If you've installed everything correctly, then C:\Users\your_username\AppData\Roaming\camel_tools\data (I'm assuming you are running things on WIndows) should be a valid directory.

However, this has also identified a bug where words that cannot be analyzed, are getting output as 'NOAN' as opposed to being output as they were written. I will fix that and update you.

Sue-Fwl commented 4 years ago

Greetings, Many thanks for the prompt replies.

Morphology While I'm still not familiarized with all the features but as far as the ones I know, the results are accurate. Many thanks. Examples :

'لقي قاتل الجندي مصرعه'
'قاتل الجندي في المعركة'
 'صمتٌ قاتل'
analyzer.analyze('قاتل')
analyses[0]:
{'diac': 'قاتَلَ', 'lex': 'قاتَل_1', 'bw': 'قاتَل/PV+َ/PVSUFF_SUBJ:3MS', 'gloss': 'fight+he;it_<verb>', 'pos': 'verb', 'prc3': '0', 'prc2': '0',
 'prc1': '0', 'prc0': '0', 'per': '3', 'asp': 'p', 'vox': 'a', 'mod': 'i', 'stt': 'na', 'cas': 'na', 'enc0': '0', 'rat': 'n', 'source': 'lex', 'form_gen': 'm',
 'form_num': 's', 'catib6': '+VRB+', 'pos_freq': -1.023208, 'd3tok': 'قاتَلَ', 'd2seg': 'قاتَلَ', 'root': 'ق.ت.ل', 'd1seg': 'قاتَلَ', 'gen': 'm',
 'd1tok': 'قاتَلَ', 'caphi': 'q_aa_t_a_l_a', 'd3seg': 'قاتَلَ', 'lex_freq': -4.497461, 'ud': '+VERB+', 'pattern': '1ا2َ3َ', 'atbtok': 'قاتَلَ',
 'pos_lex_freq': -4.497461, 'd2tok': 'قاتَلَ', 'atbseg': 'قاتَلَ', 'num': 's', 'stem': 'قاتَل', 'stemgloss': 'fight', 'stemcat': 'PV'}
analyses[1]:
{'diac': 'قاتِل', 'lex': 'قاتِل_1', 'bw': 'قاتِل/ADJ', 'gloss': 'deadly;fatal', 'pos': 'adj', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'n', 'source': 'lex', 'form_gen': 'm', 'form_num': 's', 'catib6': '+NOM+', 'pos_freq': -0.9868824, 'd3tok': 'قاتِل', 'd2seg': 'قاتِل', 'root': 'ق.ت.ل', 'd1seg': 'قاتِل', 'gen': 'm', 'd1tok': 'قاتِل', 'caphi': 'q_aa_t_i_l', 'd3seg': 'قاتِل', 'lex_freq': -4.660188, 'ud': '+ADJ+', 'pattern': '1ا2ِ3', 'atbtok': 'قاتِل', 'pos_lex_freq': -4.660188,
 'd2tok': 'قاتِل', 'atbseg': 'قاتِل', 'num': 's', 'stem': 'قاتِل', 'stemgloss': 'deadly;fatal', 'stemcat': 'N-ap'}
analyses[12]:
{'diac': 'قاتِل', 'lex': 'قاتِل_2', 'bw': 'قاتِل/NOUN', 'gloss': 'murderer;assassin', 'pos': 'noun', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'r', 'source': 'lex', 'form_gen': 'm', 'form_num': 's', 'catib6': '+NOM+', 'pos_freq': -0.4344233, 'd3tok': 'قاتِل', 'd2seg': 'قاتِل', 'root': 'ق.ت.ل', 'd1seg': 'قاتِل', 'gen': 'm', 'd1tok': 'قاتِل',
 'caphi': 'q_aa_t_i_l', 'd3seg': 'قاتِل', 'lex_freq': -4.497461, 'ud': '+NOUN+', 'pattern': '1ا2ِ3', 'atbtok': 'قاتِل', 'pos_lex_freq': -4.497461,
 'd2tok': 'قاتِل', 'atbseg': 'قاتِل', 'num': 's', 'stem': 'قاتِل', 'stemgloss': 'murderer;assassin', 'stemcat': 'Nall'}

On a side note, I would like to suggest creating a notice of some sort when modifying the data. Because while testing the new updates I got an error concerning the database (I can't remember exactly since I forgot to copy the message), and I figured as the project went through major changes it's normal for the database to go through changes too. So I reinstalled the data files and replaced the old ones (downloaded 28th of Augest ) with them and the modules worked fine, and so did the MorphologicalTokenizer.

owo commented 4 years ago

Hi @Sue-Fwl ,

I'm glad everything worked out.

On a side note, I would like to suggest creating a notice of some sort when modifying the data.

Yes, we will definitely do so on official releases (those installed from pip), and we will indicate the minimum camel_tools version the current data files are compatible with.

However, we are moving very quickly with development at the moment for the next official release and so you'll have to expect both code and data will change at any moment. The master branch is not an official release but represents the current state of development.

As a rule of thumb please reinstall the data files whenever you reinstall camel_tools from master (if you do experience any issues at the very least).

CAMeL-Lab / camel_tools

Some operations/functions usage need more clarification please #18