Continue parsing latin_text_perseus XML files to CLTK JSON data format

lukehollis commented 8 years ago

We have begun converting the XML files of the latin_text_perseus corpus to the JSON data format that will be offered to the reading environment by the API. The files that are already converted are available here: https://github.com/cltk/latin_text_perseus/tree/master/json As we batch process the conversion of the JSON files, they should be added to the latin_text_perseus repo's /json directory.

lukehollis commented 8 years ago

Cross reference this comment by @kylepjohnson about the status of development for document conversion and how the texts are served by the API. https://github.com/cltk/cltk/issues/134#issuecomment-192423187

modassir commented 8 years ago

@lukehollis can you elaborate the format of different types of json to be served by the api? as far as I understood, there should be author, meta, title and text keys in the json. So the task is to convert all the xml files to the above simple json format?

lukehollis commented 8 years ago

Okay, I've got more info on the JSON data format up in the cltk_api wiki: https://github.com/cltk/cltk_api/wiki/JSON-data-format-specifications Could move it to the cltk wiki if things get too confusing.

lukehollis commented 8 years ago

@kylepjohnson @modassir reflecting on this, I think the only parameter that we'd be justified in adding to the converted JSON at this stage is a path and filename of the converted file. For the example of Vergil's Aeneid, something like "Vergil/opensource/verg.a_lat.xml". I don't want to over-optimize, but I think including this would help us with cache invalidation issues whenever a document is updated.

kylepjohnson commented 8 years ago

@lukehollis Your format on the wiki looks good. Anything you or contribs want/need added to it, I'm open to it. My preference is to have all/most metadata contained in an author's json, instead of an index. When we need to generate indices, we can pull just what we need from author files – thus more DRY.

gree-gorey commented 8 years ago

Hi guys, my suggestion is to use XSLT transformations for the raw xml files to make them easier to convert. My sample XSLT script allowed to convert one of raw xml files into pretty xml one.

But the question is can we generalize it and apply such a script to all of the raw files? I tried to investigate the structure of these xml files and ways they can differ. Here is a small summary.

There is one problem, namely tag <milestone> does not contain text inside of it, but it rather just points the boundary between elements. One way to solve it is to use regex before transformation (which I did).

And I have several questions:

do we have to keep origin structure of the text (whether it consists of chapters or cards or sections) or we just need to know that there are parent and nested items and the exact labels are not relevant for us?
there are index xml files, and their structure is completely different from the ones containing texts. What are we to do with them?
there are some special markdown and tags, for example, for tables, or for cast lists (of the play). What do we do with them?

kylepjohnson commented 8 years ago

@gree-gorey This is very promising! I'll try to answer your questions:

do we have to keep origin structure of the text (whether it consists of chapters or cards or sections) or we just need to know that there are parent and nested items and the exact labels are not relevant for us?

Yes, we must keep the original structure of the text, which could be chapters + sections, or books + line numbers, etc. If you don't know the structure of a particular ancient work, you can see what we have done or ask one of us.

there are index xml files, and their structure is completely different from the ones containing texts. What are we to do with them?

No, those aren't important right now. But do send me a link, I haven't looked at these yet.

there are some special markdown and tags, for example, for tables, or for cast lists (of the play). What do we do with them?

Ugh, that's a hard one! Luke, do you have an opinion? Grigory maybe you could add them as a new field to the json called {'dramatis personae': [list, of, character, names]}

gree-gorey commented 8 years ago

@kylepjohnson

Yes, we must keep the original structure of the text

I see.

No, those aren't important right now. But do send me a link, I haven't looked at these yet.

Here, for example.

Grigory maybe you could add them as a new field to the json called {'dramatis personae': [list, of, character, names]}

Yeah, maybe I could. I will look at them. And what about tables? Like these?

lukehollis commented 8 years ago

I think dramatis_personae sounds good! Re: those tables--those are really weird. Maybe we can just treat them like normal lines of text for the time being?

Having issues accessing content on cltk_frontend due to the range of different document formats in our JSON. Made a separate issue to address nailing down what those should be: https://github.com/cltk/cltk_api/issues/12

modassir commented 8 years ago

I am currently working on books-chapter, one of the methods I used is:

chapters = div1.findall('p')
for chapter in chapters:
    if not len(chapter.xpath('milestone[@unit="chapter"]')):
        # print("Skipping <p> without milestone chapter")
        continue
    chapter_number = str(chapter.xpath('milestone[@unit="chapter"]')[0].get('n'))
    dict_text[number]['chapters'][chapter_number] = {}
    sections = chapter.xpath('text()')
    for count, section in enumerate(sections, 1):
        dict_text[number]['chapters'][chapter_number][count] = section

This method do provide us with the chapters of the xml files with <p><milestone n="{num}" unit="chapter"> tags like caes.bg_lat.json, now working on other type of xml files too with <div2 type="chapter" n="{num}">

modassir commented 8 years ago

For <div2 type="chapter" n="{num}">, using this method :

div2s = div1.findall('div2')
if len(div2s) and div2s[0].get('type').lower() == 'chapter':
    for div2 in div2s:
        chapter_number = str(div2.get('n'))
        dict_text[number]['chapters'][chapter_number] = {}
        div2 = etree.tostring(div2).decode()
        div2 = re.sub(r'<p>', '', div2)
        div2 = re.sub(r'</p>', '', div2)
        div2 = re.sub(r'<gap/>', '', div2)
        div2 = div2.replace('\n', '')
        div2 = etree.fromstring(div2)
        sections = div2.xpath('text()')
        for count, section in enumerate(sections, 1):
            dict_text[number]['chapters'][chapter_number][count] = section

It gives us json similar to above caes.bc_lat.json.

kylepjohnson commented 8 years ago

@modassir Looks like you're having success with this. Good work.

One small but important correction: We don't want to label "chapters" like you've done here:

"text": {"1": {"chapters": {"35": …

The nesting of text, as I have designed it, is chunking–neutral. By which I mean all texts are structured in a similar way; their structure is expressed by the "meta" tag (eg, for Ceasar's Bellum Civile, which you have linked to, the meta tag would be "book-chapter-section".

modassir commented 8 years ago

@kylepjohnson ok sure, i will correct that.

modassir commented 8 years ago

Do we need to keep the text from <note>Some Text</note> ?

kylepjohnson commented 8 years ago

No, we won't want that note text. Thanks

sameeriitkgp commented 8 years ago

I had a different approach to this, using bs4. I'll try that on the greek perseus.

kylepjohnson commented 8 years ago

@SameerIITKGP I made ticket #17 for you. Let's all stay in touch with how things go.

cltk / cltk_api

Continue parsing latin_text_perseus XML files to CLTK JSON data format #11