cessda / cessda.cdc.versions

Issue track and wiki for the CESSDA Data Catalogue
https://datacatalogue.cessda.eu/
Apache License 2.0
0 stars 0 forks source link

Language detection tools #374

Open cessda-bitbucket-importer opened 2 years ago

cessda-bitbucket-importer commented 2 years ago

Original report on BitBucket by Taina Jääskeläinen.


CDC has records that do not have language tags. These will be removed by validation but we do not know how many this will reject. Then there are records that have the title and abstract in the local language but the vocabulary elements are in English. It would be best to place them in the local langauge catalogue.

Therefore I asked CLARIN if they can advise on a good language detection tool to be implemented under the hood in CDC. It is definitely their expertise area. Twan Goosen answered:


You could search the Virtual Language Observatory for 'language detection' and some potentially useful service do come up [1], but for scenarios where we have to integrate language detection into some service our best experience is with Apache Tika [2].

You can see it in action by uploading a file or pasting some text into the Language Resource Switchboard [3] yourself. After submitting something, you will see that the Switchboard pre-selected the detected language. If you would like to have more details about how we use/integrated Tika in that context, let me know and I'll put the developer of the Switchboard in the loop.

Hope this helps!

Best,
Twan

[1] https://vlo.clarin.eu/?q=language+detection
[2] https://tika.apache.org/2.1.0/detection.html#Language_Detection
[3] https://switchboard.clarin.eu


I wonder if Matthew could take a look and see if the tool would work for CDC. I tested it in the Switchboard by copying three sentences from the abstract in all languages and detection was correct in each case. I think the detection could be done using the title and abstract elements. Would solve a lot of problems even now and would remove the need to reject data based on language issues.

I put Twan’s email here in case Matthew or John would like to contact him to get details how to integrate Tika.

[email address removed](mailto:email address removed)

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Pushed back to next release

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Taina wrote:

Retried the Apache Tika (https://tika.apache.org/2.1.0/detection.html#Language_Detection)) language detection tool recommended from CLARIN and used in their switchboard (https://switchboard.clarin.eu/)..)
Detected one failure: gave language as Croatian when copied Serbian code definition from CVS. But the two languages are basically the same, only Croatian uses Latin script and Serbian cyrillic. The copied Serbian text was in Latin script, hence the error.

cessda-bitbucket-importer commented 2 years ago

Original comment by Taina Jääskeläinen.


But all the other languages were detected correctly which was a good result, I think.

Putting this issue on hold. CDC Roadmap needs to be clearer before it can be discussed whether this functionality would be useful.

cessda-bitbucket-importer commented 2 years ago

Original comment by Taina Jääskeläinen.


If comes up later, need to test if can detect Finnish vs Estonian reliably. At least if any Estonian data is likely.

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Updating version is preferred to setting status to on hold.

cessda-bitbucket-importer commented 2 years ago

Original comment by Taina Jääskeläinen.


To review in 2023 whether they are still records in the UI in the wrong language section. If yes, consider this issue.