cdli-gh / data

This is a copy of the daily dump of catalogue and ATF data from the Cuneiform Digital Library Initiative (http://cdli.ucla.edu)
http://cdli.ucla.edu/bulk_data
53 stars 12 forks source link

ATF files parsing issue #66

Open hohilwik opened 1 year ago

hohilwik commented 1 year ago

I am exploring machine translation for Sumerian and trying to parse atf files using pyorrac and cdli/atf2tei parsers instead of writing my own, and even the parser.py that was in this repo from a previous pull request, but nothing seems to work correctly and all of them throw errors. Is something wrong with the corpus? If not, how can I fix it without having to manually dig out all the problems?

After fixing a lot of of "?" marks at the end of @ broken or other signifiers, most of the problems are empty entries. Any way to fix this?

epageperron commented 1 year ago

Hi Shirley, The corpus isn't perfectly clean. If you want a cleaner version than this one you can use the API client and fetch the most recent texts https://github.com/cdli-gh/framework-api-client the server URL is https://cdli.mpiwg-berlin.mpg.de/. If you want to fix parts of the corpus you are welcome to open an account and submit change suggestions which we will review and integrate https://cdli.mpiwg-berlin.mpg.de/register

epageperron commented 1 year ago

Also please feel free to tryout our translation pipeline https://github.com/cdli-gh/Sumerian-Translation-Pipeline

hohilwik commented 1 year ago

I am working on improving on that pipeline actually. Thanks for the suggestion though! I'll look into the API and try to get the cleaner version and submit changes whenever I find some error.

Thanks a lot for the info