Open hohilwik opened 1 year ago
Hi Shirley, The corpus isn't perfectly clean. If you want a cleaner version than this one you can use the API client and fetch the most recent texts https://github.com/cdli-gh/framework-api-client the server URL is https://cdli.mpiwg-berlin.mpg.de/. If you want to fix parts of the corpus you are welcome to open an account and submit change suggestions which we will review and integrate https://cdli.mpiwg-berlin.mpg.de/register
Also please feel free to tryout our translation pipeline https://github.com/cdli-gh/Sumerian-Translation-Pipeline
I am working on improving on that pipeline actually. Thanks for the suggestion though! I'll look into the API and try to get the cleaner version and submit changes whenever I find some error.
Thanks a lot for the info
I am exploring machine translation for Sumerian and trying to parse atf files using pyorrac and cdli/atf2tei parsers instead of writing my own, and even the parser.py that was in this repo from a previous pull request, but nothing seems to work correctly and all of them throw errors. Is something wrong with the corpus? If not, how can I fix it without having to manually dig out all the problems?
After fixing a lot of of "?" marks at the end of @ broken or other signifiers, most of the problems are empty entries. Any way to fix this?