Closed matyaskopp closed 1 year ago
Can we get instructions where to correct the description of the NLP pipelines used? This will quite likely need to happen for BA, HR, RS, SL, BG.
Just checked BG, it also has a faulty description of a similar kind.
@TomazErjavec can you help here? Do we want to fix it in ParlaMint 4.0 release ?
@TomazErjavec can you help here? Do we want to fix it in ParlaMint 4.0 release ?
I've already processed BA and BG, so I would then need to do it again and time is short. What about somebody (@nljubesi?) posting here which corpora should have their appInfo element changed and with what, and hopefully I can fit it in? (the current state is in Samples)
For Bosnian, discussed here, to make things as easily applicable to the remaining South-Slavic languages, this is a reasonable description:
<appInfo>
<application version="2.0" ident="classla-stanza">
<label>CLASSLA-Stanza</label>
<desc xml:lang="en">Segmentation, tokenization, MSD tagging, lemmatisation and named entity recognition with CLASSLA-Stanza, available from <ref target="https://github.com/clarinsi/classla">https://github.com/clarinsi/classla</ref>.</desc>
</application>
</appInfo>
I also checked other languages that were annotated with CLASSLA-Stanza, and all languages might profit from this description. Slovenian is most up-to-date, but is also outdated, others are severely outdated.
Please make sure that all NLP applications are exchanged for this single entry, including the tagger and the NER system.
this is a reasonable description:
Is it? What about UD tagging and esp. syntax?
I also checked other language
It would help me a lot if you listed (as I asked) the two letter country codes of the affected corpora. My head is exploding alreday from the multiple corpora I need to do fixes to.
Description additionally simplified:
<appInfo>
<application version="2.0" ident="classla-stanza">
<label>CLASSLA-Stanza</label>
<desc xml:lang="en">Linguistic processing with CLASSLA-Stanza, available from <ref target="https://github.com/clarinsi/classla">https://github.com/clarinsi/classla</ref>.</desc>
</application>
</appInfo>
Country codes are: BA, BG, HR, RS, SI
Thanks @nljubesi, have substituted it in the sources and will process BA BG again. So, closing.
Croatian model
https://github.com/clarin-eric/ParlaMint/blob/19b751a624ac93f92274adb5920b7d38e0d70e45/Samples/ParlaMint-BA/README.md?plain=1#L39-L41
Slovene model:
https://github.com/clarin-eric/ParlaMint/blob/19b751a624ac93f92274adb5920b7d38e0d70e45/Samples/ParlaMint-BA/ParlaMint-BA.ana.xml#L147-L150
also NER model for Slovene
https://github.com/clarin-eric/ParlaMint/blob/19b751a624ac93f92274adb5920b7d38e0d70e45/Samples/ParlaMint-BA/ParlaMint-BA.ana.xml#L147-L150