cltk / lat_text_perseus

Collected Latin files from the Perseus Digital Library
Other
9 stars 5 forks source link

Fix and revamp JSON parsing #2

Open kigawas opened 2 months ago

kigawas commented 2 months ago

Currently, the JSON files are not correctly parsed.

For example, minfel.octav_lat.json's text values are null. Some corrected words are mixed in the output JSON indistinguishably:

... antehac <corr sic="Constantirus">Constantinus</corr> ...
"...antehac \n Constantinus \n Constantirus ..."

I'd like to help rewrite the parsing script (xml_to_json.py) to make it work properly, can you add me as a collaborator?

kylepjohnson commented 1 month ago

I appreciate the offer. Sure, go for it!

Those XML files are infamously inconsistent. I'll make you a member of the cltk org; but can make pull requests as usual on this repo.

kigawas commented 1 month ago

@kylepjohnson Thanks! I'll propose a PR soon to regenerate JSON for Ammianus/opensource/amm_lat.xml. If it looks good to you, I'll expand it to other files as well.

kigawas commented 1 month ago

@kylepjohnson

You can compare the newly generated file with the old one.

https://github.com/kigawas/lat_text_perseus/blob/revamp-parse/cltk_json/ammianus-marcellinus__rerum-gestarum__latin.json

In the new file:

The only difference is this line: "Hoc Marte Cyzico reserata, Procopius ad eam propere festinavit, veniaque universis qui repugnavere donatis, Serenianum solum iniectis vinculis, iussit duci Nicaeam servandum artissime. 12. Statimque Ormizdae mature iuveni ..." because the original xml file misses <milestone unit="section" n="12"/> before Statimque

kylepjohnson commented 1 month ago

@kigawas I closed the last PR (#3). Let's talk about it for a bit before the next one. I think your goal ought to be to parse these better, but keep the output files otherwise the same.

kigawas commented 1 month ago

Are .xml.json files input or output? Since they have exactly the same information with .xml files, it's not necessary to maintain two duplicate pieces.

kylepjohnson commented 1 month ago

They are outputs and necessary, since the xml is very inconsistent and it is inconvenient for downstream m users to xml into their databases and applications.

Sep 16, 2024 at 18:20 by @.***:

Are > .xml.json> files input or output? Since they have exactly the same information with > .xml> files, it's not necessary to maintain two duplicate pieces.

— Reply to this email directly, > view it on GitHub https://github.com/cltk/lat_text_perseus/issues/2#issuecomment-2354317350> , or > unsubscribe https://github.com/notifications/unsubscribe-auth/AAOE36CLEZL3IMWD3R4WNMLZW57UDAVCNFSM6AAAAABN33XHLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJUGMYTOMZVGA> . You are receiving this because you were mentioned.> Message ID: > <cltk/lat_text_perseus/issues/2/2354317350> @> github> .> com>