clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
50 stars 53 forks source link

BA: inconsistency between model described in README and application info #799

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

Croatian model

https://github.com/clarin-eric/ParlaMint/blob/19b751a624ac93f92274adb5920b7d38e0d70e45/Samples/ParlaMint-BA/README.md?plain=1#L39-L41

Slovene model:

https://github.com/clarin-eric/ParlaMint/blob/19b751a624ac93f92274adb5920b7d38e0d70e45/Samples/ParlaMint-BA/ParlaMint-BA.ana.xml#L147-L150

also NER model for Slovene

https://github.com/clarin-eric/ParlaMint/blob/19b751a624ac93f92274adb5920b7d38e0d70e45/Samples/ParlaMint-BA/ParlaMint-BA.ana.xml#L147-L150

nljubesi commented 1 year ago

Can we get instructions where to correct the description of the NLP pipelines used? This will quite likely need to happen for BA, HR, RS, SL, BG.

Just checked BG, it also has a faulty description of a similar kind.

https://github.com/clarin-eric/ParlaMint/blob/19b751a624ac93f92274adb5920b7d38e0d70e45/Samples/ParlaMint-BG/ParlaMint-BG.ana.xml#L146-L151

matyaskopp commented 1 year ago

@TomazErjavec can you help here? Do we want to fix it in ParlaMint 4.0 release ?

TomazErjavec commented 1 year ago

@TomazErjavec can you help here? Do we want to fix it in ParlaMint 4.0 release ?

I've already processed BA and BG, so I would then need to do it again and time is short. What about somebody (@nljubesi?) posting here which corpora should have their appInfo element changed and with what, and hopefully I can fit it in? (the current state is in Samples)

nljubesi commented 1 year ago

For Bosnian, discussed here, to make things as easily applicable to the remaining South-Slavic languages, this is a reasonable description:

         <appInfo>
            <application version="2.0" ident="classla-stanza">
               <label>CLASSLA-Stanza</label>
               <desc xml:lang="en">Segmentation, tokenization, MSD tagging, lemmatisation and named entity recognition with CLASSLA-Stanza, available from <ref target="https://github.com/clarinsi/classla">https://github.com/clarinsi/classla</ref>.</desc>
            </application>
         </appInfo>
nljubesi commented 1 year ago

I also checked other languages that were annotated with CLASSLA-Stanza, and all languages might profit from this description. Slovenian is most up-to-date, but is also outdated, others are severely outdated.

Please make sure that all NLP applications are exchanged for this single entry, including the tagger and the NER system.

TomazErjavec commented 1 year ago

this is a reasonable description:

Is it? What about UD tagging and esp. syntax?

I also checked other language

It would help me a lot if you listed (as I asked) the two letter country codes of the affected corpora. My head is exploding alreday from the multiple corpora I need to do fixes to.

nljubesi commented 1 year ago

Description additionally simplified:

         <appInfo>
            <application version="2.0" ident="classla-stanza">
               <label>CLASSLA-Stanza</label>
               <desc xml:lang="en">Linguistic processing with CLASSLA-Stanza, available from <ref target="https://github.com/clarinsi/classla">https://github.com/clarinsi/classla</ref>.</desc>
            </application>
         </appInfo>

Country codes are: BA, BG, HR, RS, SI

TomazErjavec commented 1 year ago

Thanks @nljubesi, have substituted it in the sources and will process BA BG again. So, closing.