cltk / lat_text_perseus

Collected Latin files from the Perseus Digital Library
Other
9 stars 5 forks source link

Revamp parse JSON #3

Closed kigawas closed 2 months ago

kigawas commented 2 months ago

2

kylepjohnson commented 2 months ago

remove redudant files

So you removed the files ending with .xml.json, but kept the rest? TBH I do not remember why those were in there. Is the formatting/content of those any better than those with just .json?

kigawas commented 2 months ago

.xml.json file was probably just converted from its corresponding xml file. You may have thought it's easier to process json or something.

Since it has neither more nor less information than xml so it's better to just remove. We can directly handle xml by the lxml library

kylepjohnson commented 2 months ago

.xml.json file was probably just converted from its corresponding xml file. You may have thought it's easier to process json or something.

Yes it was, but we do not want to remove any of these. There was a concerted effort to transform all those XML into JSON that they be more readable by developers of web applications. So whatever we do, cannot delete the JSON files.

I would like to close this PR and for your to start fresh. We do need better parsing of the XML (and you showed our previous work was not good enough) however we must parse it for some purpose, and into some format (we need to keep the current JSON schema). Thanks in advance for your understanding, we can keep talking here.