cdli-gh / data

This is a copy of the daily dump of catalogue and ATF data from the Cuneiform Digital Library Initiative (http://cdli.ucla.edu)
http://cdli.ucla.edu/bulk_data
50 stars 12 forks source link

Inconsistent tokenization #60

Open MrLogarithm opened 4 years ago

MrLogarithm commented 4 years ago

In the ED IIIb data from Girsu, the tokenization is not consistent. Examples include:

A shell script could probably enumerate more examples.

Is there a principled way to decide which tokenizations are correct and harmonize all of the spellings?

epageperron commented 4 years ago

Yes, an assyriologist must look at both, make a decision and update all atf. Our Bulk upload on the site is broken right now for some obscure reason so ill try to fix it soon and then we can proceed in harmonizing those. thanks !