Closed Sue-Fwl closed 4 years ago
Hi @Sue-Fwl,
Here are the answers to your queries:
"camel_arclean" I couldn't find it's class or the way to invoke it
This is just uses a CharMapper instance with the 'arclean'
scheme.
You can instantiate one as follows:
from camel_tools.utils.charmap import CharMapper
arclean = CharMapper.builtin_mapper('arclean')
noisy_text = '...' # This is the noisy text we wish to clean
cleaned_text = arclean(dirty_text)
"dialectid " I've tried to run the example provided but I'm getting this error
So there are a couple of issues here:
Regarding issues with CalimaStarAnalyzer
, CalimaStarReinflector
and CalimaStarGenerator
, I'm a bit confused about your set up.
Can you tell me how you installed camel-tools? Did you pip/conda install it or did you install it from source from the master
branch?
When you mention the following line:
db = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\calima-msa-1.0.db', 'a')
Where did you get this database from?
If you installed via pip/conda, there is only one database provided and is part of the package itself, and would be located in E:\Anaconda3\Lib\site-packages\camel_tools\calima_star\databases\almor-msa\almor-msa-r13.db
.
On the other hand, if you installed from source from the master
branch, as per the README, the data should be in a separate directory from the python package.
Assuming you installed the data at ~\AppData\Roaming\camel_tools\
, then the default database should be located at ~\AppData\Roaming\camel_tools\data\morphology_db\almor_msa_ext\morphology.db
.
As for issues with MLEDisambiguator
there are a couple of things:
First, I just found an issue in the MLEDisambiguator code that wasn't correctly loading the database file. I'll fix this and update the repo ASAP. This is one of the disambiguation issues you have, where everything is being analyzed as NOUN_PROP.
Second, the following line is incorrect:
disa = MLEDisambiguator(CalimaStarAnalyzer(db_disa, backoff='NONE', norm_map='<camel_tools.utils.charmap.CharMapper object>', strict_digit=False, cache_size=0), mle_path=None)
Specifically, norm_map = '<camel_tools.utils.charmap.CharMapper object>'
is incorrect here. I'm guessing you copied this directly from the documentation, which I agree is a bit confusing. All this is saying (in the documentation) is that the default value for norm_map is a pre-defined CharMapper object. Really you should just use mle = MLEDisambiguator.pretrained()
if you do not want to change the default values.
Having said all that, please note that the code on the repo is still under development and that there are breaking changes coming soon to camel_tools.calima_star and all related code you have above will not work. I will also be adding more examples to clarify things.
Sorry for any frustration this has caused.
Really appreciate your help. I did know the project is still on going, which is why I'm verifying which parts I'm misusing and which aren't ready yet. I'm currently using the NER, Transliteration, and SA functionalities on my data and I'm pleased with the results I'm getting.
Regarding issues with CalimaStarAnalyzer, CalimaStarReinflector and CalimaStarGenerator
I installed it via pip in anaconda, then when I found a new update that wasn't possible through pip -around the 26th of August- I reinstalled it from the master branch.
Where did you get this database from? 'databases\calima-msa-1.0.db'
I tried the db mentioned 'almor-msa-r13.db' and the functions didn't work properly so I search for another one in the databases folder. I retried calling CalimaStarAnalyzer with the db you mentioned but I'm still not getting proper results.
db2 = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\almor-msa\\almor-msa-r13.db', 'a')
db = CalimaStarDB('C:\\Users\\Sana trafalgar\\AppData\\Roaming\\camel_tools\\data\\morphology_db\\almor-msa-ext\\morphology.db', 'a')
analyzer = CalimaStarAnalyzer(db, 'NOAN_ALL')
analyzer2 = CalimaStarAnalyzer(db2, 'NOAN_ALL')
analyses = analyzer.analyze('مشيت')
analyses2 = analyzer2.analyze('مشيت')
Out[7]: #both returned the exact same results
{'diac': 'مشيت', 'lex': 'مشيت_0', 'bw': 'مشيت/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0',
'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff',
'form_gen': '-', 'form_num': '-', 'catib6': '+NOM+', 'pos_freq': -99.0, 'gen': '-', 'lex_freq': -99.0, 'ud': '+PROPN+', 'pos_lex_freq': -99.0, 'num': '-', 'd3tok': 'NOAN', 'd2seg': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd1seg': 'NOAN', 'd1tok': 'NOAN', 'caphi': 'm_sh_y_t', 'd3seg': 'NOAN', 'atbtok': 'NOAN', 'd2tok': 'NOAN', 'atbseg': 'NOAN', 'stem': 'مشيت', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'}
For CalimaStarGenerator, I got an error message:
# tried both (morphology.db, almor-msa-r13.db) and both returned the same error
db2 = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\almor-msa\\almor-msa-r13.db', 'a')
generator2 = CalimaStarGenerator(db2)
lemma = 'مشى'
features = {"form_num":"s","gen":"m","mod":"i"}
analysesLF2 = generator2.generate(lemma, features)
Traceback (most recent call last):
File "", line 48, in <module>
generator2 = CalimaStarGenerator(db2)
File "\camel_tools\calima_star\generator.py", line 59, in __init__ raise GeneratorError('DB does not support generation')
GeneratorError: DB does not support generation
For CalimaStarReinflector, I tried both dbs and got the same error too: 'ReinflectorError: DB does not support reinflection'
For MorphologicalTokenizer. After using the pretrained(), it kinda worked, but didn't detect all words:
from camel_tools.tokenizers import morphological
from camel_tools.disambig.mle import MLEDisambiguator
disa = MLEDisambiguator.pretrained()
res_morph = morphological.MorphologicalTokenizer(disa, scheme='d3tok', split=True, diac=True) #gives an error when diac is False
tokenized_morph = res_morph.tokenize(text_token)
text_token
Out[29]: ['الطفلان', 'أكلا', 'الطعام', 'معاً', 'وأخذا', '5', 'تفاحات']
tokenized_morph
Out[30]: [['NOAN'], ['NOAN'], ['ال+', 'طَعامِ'], ['مَعاً'], ['NOAN'], ['5'], ['NOAN']]
Hi @Sue-Fwl ,
Since a few things have changed since your original issue post, could you please download the latest copy of camel-tools from the master branch and re-install the camel-tools data as per the data installation instructions in the README? When re-installing from the master branch, you may need to do the following:
pip install --upgrade --force-reinstall .
First, the entire camel_tools.calima_star API has now changed to camel_tools.morphology (these are the breaking changes I mentioned at the end of my previous response). After performing the re-installation mentioned above, can you please try the updated examples in the documentation and let me know if everything works? Here are the doc links:
As for the MorphologicalTokenizer, when everything is installed correctly, I get the following output for your example:
>>> tokenized_morph
[['ال+', 'طِفْلانِ'], ['أَكَلا'], ['ال+', 'طَعامِ'], ['مَعاً'], ['وَ+', 'أَخَذا'], ['5'], ['تُفّاحات']]
The output you got is probably due to not having the data installed correctly. If you've installed everything correctly, then C:\Users\your_username\AppData\Roaming\camel_tools\data
(I'm assuming you are running things on WIndows) should be a valid directory.
However, this has also identified a bug where words that cannot be analyzed, are getting output as 'NOAN' as opposed to being output as they were written. I will fix that and update you.
Greetings, Many thanks for the prompt replies.
Morphology While I'm still not familiarized with all the features but as far as the ones I know, the results are accurate. Many thanks. Examples :
'لقي قاتل الجندي مصرعه'
'قاتل الجندي في المعركة'
'صمتٌ قاتل'
analyzer.analyze('قاتل')
analyses[0]:
{'diac': 'قاتَلَ', 'lex': 'قاتَل_1', 'bw': 'قاتَل/PV+َ/PVSUFF_SUBJ:3MS', 'gloss': 'fight+he;it_<verb>', 'pos': 'verb', 'prc3': '0', 'prc2': '0',
'prc1': '0', 'prc0': '0', 'per': '3', 'asp': 'p', 'vox': 'a', 'mod': 'i', 'stt': 'na', 'cas': 'na', 'enc0': '0', 'rat': 'n', 'source': 'lex', 'form_gen': 'm',
'form_num': 's', 'catib6': '+VRB+', 'pos_freq': -1.023208, 'd3tok': 'قاتَلَ', 'd2seg': 'قاتَلَ', 'root': 'ق.ت.ل', 'd1seg': 'قاتَلَ', 'gen': 'm',
'd1tok': 'قاتَلَ', 'caphi': 'q_aa_t_a_l_a', 'd3seg': 'قاتَلَ', 'lex_freq': -4.497461, 'ud': '+VERB+', 'pattern': '1ا2َ3َ', 'atbtok': 'قاتَلَ',
'pos_lex_freq': -4.497461, 'd2tok': 'قاتَلَ', 'atbseg': 'قاتَلَ', 'num': 's', 'stem': 'قاتَل', 'stemgloss': 'fight', 'stemcat': 'PV'}
analyses[1]:
{'diac': 'قاتِل', 'lex': 'قاتِل_1', 'bw': 'قاتِل/ADJ', 'gloss': 'deadly;fatal', 'pos': 'adj', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'n', 'source': 'lex', 'form_gen': 'm', 'form_num': 's', 'catib6': '+NOM+', 'pos_freq': -0.9868824, 'd3tok': 'قاتِل', 'd2seg': 'قاتِل', 'root': 'ق.ت.ل', 'd1seg': 'قاتِل', 'gen': 'm', 'd1tok': 'قاتِل', 'caphi': 'q_aa_t_i_l', 'd3seg': 'قاتِل', 'lex_freq': -4.660188, 'ud': '+ADJ+', 'pattern': '1ا2ِ3', 'atbtok': 'قاتِل', 'pos_lex_freq': -4.660188,
'd2tok': 'قاتِل', 'atbseg': 'قاتِل', 'num': 's', 'stem': 'قاتِل', 'stemgloss': 'deadly;fatal', 'stemcat': 'N-ap'}
analyses[12]:
{'diac': 'قاتِل', 'lex': 'قاتِل_2', 'bw': 'قاتِل/NOUN', 'gloss': 'murderer;assassin', 'pos': 'noun', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'r', 'source': 'lex', 'form_gen': 'm', 'form_num': 's', 'catib6': '+NOM+', 'pos_freq': -0.4344233, 'd3tok': 'قاتِل', 'd2seg': 'قاتِل', 'root': 'ق.ت.ل', 'd1seg': 'قاتِل', 'gen': 'm', 'd1tok': 'قاتِل',
'caphi': 'q_aa_t_i_l', 'd3seg': 'قاتِل', 'lex_freq': -4.497461, 'ud': '+NOUN+', 'pattern': '1ا2ِ3', 'atbtok': 'قاتِل', 'pos_lex_freq': -4.497461,
'd2tok': 'قاتِل', 'atbseg': 'قاتِل', 'num': 's', 'stem': 'قاتِل', 'stemgloss': 'murderer;assassin', 'stemcat': 'Nall'}
On a side note, I would like to suggest creating a notice of some sort when modifying the data. Because while testing the new updates I got an error concerning the database (I can't remember exactly since I forgot to copy the message), and I figured as the project went through major changes it's normal for the database to go through changes too. So I reinstalled the data files and replaced the old ones (downloaded 28th of Augest ) with them and the modules worked fine, and so did the MorphologicalTokenizer.
Hi @Sue-Fwl ,
I'm glad everything worked out.
On a side note, I would like to suggest creating a notice of some sort when modifying the data.
Yes, we will definitely do so on official releases (those installed from pip), and we will indicate the minimum camel_tools version the current data files are compatible with.
However, we are moving very quickly with development at the moment for the next official release and so you'll have to expect both code and data will change at any moment. The master branch is not an official release but represents the current state of development.
As a rule of thumb please reinstall the data files whenever you reinstall camel_tools from master (if you do experience any issues at the very least).
Greetings,
Appreciate the efforts of doing such a grand project and making it accessible to other researchers,
I'm currently trying out the tools to use it in cleaning and extracting features off my data and I'm having trouble using some of the functionalities because their documentation isn't published yet or isn't clear enough.
"camel_arclean" I couldn't find it's class or the way to invoke it
"dialectid " I've tried to run the example provided but I'm getting this error
"CalimaStarAnalyzer" I'm getting POS=noun_prop for all words, and never getting a stem. I'm depending on the first returned list of the list of lists that is returned by the functions, even though I checked the rest and didn't find any right analysis. I used it on my data and used it on the example provided but couldn't figure out what's wrong. for example the verb 'مشيت' when analyzed gives a number of possible tags but none of them is 'verb'
A snippet of returned analysis
"Generate lemma and features (CalimaStarReinflector)" I couldn't find the file of the lemma db, and it wasn't clear the way of constructing the features dictionary.
"CalimaStarGenerator" Same issue as above.
" Morphological Analyzer " I'm not getting any analysis results, and the morphological tokenizer 'tokenize' is giving the same results as the 'simple_word_tokenize' in tokenizers
" MLEDisambiguator "
I'm getting results on some nouns, but so far I had no luck with POS or other features such as form_num, gen, mod when it comes to plurals, ones that are connected to a pronoun or verbs.. etc