clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

ES-CT: paragraphs wrongly marked as English #697

Closed TomazErjavec closed 1 year ago

TomazErjavec commented 1 year ago

In the ES-CT corpus there are 18 paragraphs that have @xml:lang="en", even though at least the ones I've checked are not in fact in English. This is currently causing problems for making the MTed corpus, as the source for MT were CoNLL-U files split into ca and es, which now need to be joined to make the complete MTed corpus. However, the so called "en" paragraphs are neither in ca nor es CoNLL-U, so they were not translated and are missing from the MT output. I will somehow find a workaoround for this but the errors should be fixed for 3.1 Below the list of "en" paragraphs:

ParlaMint-ES-CT_2016-07-13-2001.124.0.5
ParlaMint-ES-CT_2016-07-27-2103.108.0.2
ParlaMint-ES-CT_2016-11-09-2501.219.0.0
ParlaMint-ES-CT_2016-11-30-2601.136.0.3
ParlaMint-ES-CT_2016-12-01-2602.16.0.5
ParlaMint-ES-CT_2016-12-21-2702.180.0.8
ParlaMint-ES-CT_2016-12-22-2703.52.0.28
ParlaMint-ES-CT_2017-04-05-3301.161.0.29
ParlaMint-ES-CT_2018-05-12-0901.3.0.25
ParlaMint-ES-CT_2018-05-12-0901.3.0.26
ParlaMint-ES-CT_2018-05-12-0901.3.0.27
ParlaMint-ES-CT_2019-06-14-3302.57.0.4
ParlaMint-ES-CT_2019-06-14-3302.65.0.3
ParlaMint-ES-CT_2019-06-14-3302.65.0.4
ParlaMint-ES-CT_2019-10-17-4001.19.0.7
ParlaMint-ES-CT_2021-07-07-0901.150.0.38
ParlaMint-ES-CT_2021-07-07-0901.152.0.11
ParlaMint-ES-CT_2021-11-30-1901.30.0.2
matyaskopp commented 1 year ago

@nuriabel, and also make sure that you are using the correct model for linguistic annotation

nuriabel commented 1 year ago

Hi Tomaz and Matyas, Yes, I see that we need to redo a number of things for version 3.1. Thanks for all your analysis!! N.

El jue, 22 jun 2023 a las 11:44, Matyáš Kopp @.***>) escribió:

@nuriabel https://github.com/nuriabel, and also make sure that you are using the correct model for linguistic annotation

— Reply to this email directly, view it on GitHub https://github.com/clarin-eric/ParlaMint/issues/697#issuecomment-1602333961, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFJPVZA4HOMIIRG33KQL6TXMQHYJANCNFSM6AAAAAAZP5UAEE . You are receiving this because you were mentioned.Message ID: @.***>

TomazErjavec commented 1 year ago

@nuriabel, pls. heads up on this issue, we need to finalize the corpora.

nuriabel commented 1 year ago

Dear Tomaz, Rodolfo has solved most of the problems you found. He is now working with the wrong joint right. I'll make sure that the wrong lang=en are also corrected for the next delivery. Tomorrow I'm officially back from holidays, and this is in my priority list.

TomazErjavec commented 1 year ago

Great, thank you! Looking forward to you polished corpus! :)

TomazErjavec commented 1 year ago

This, as far as I can see, has been fixed, thanks. Closing the issue.