Living-with-machines / alto2txt

Convert ALTO XML to plain text + minimal metadata
https://living-with-machines.github.io/alto2txt/
MIT License
13 stars 2 forks source link

Remove dependancy on input directory tree structure #30

Open LydiaFrance opened 2 years ago

LydiaFrance commented 2 years ago

Currently the code base has assumptions about the directory tree. Either:

    xml_in_dir
    |-- publication
    |   |-- year
    |   |   |-- issue
    |   |   |   |-- xml_content
    |   |-- year
    |-- publication

Or

    xml_in_dir
    |-- year
    |   |-- issue
    |   |   |-- xml_content
    |-- year

Adding capability to create the output directory tree using the mets file (which contains the publication, year, issue) or simply the filename (which contains the year and issue) or allowing a flat directory output.

This will be useful for people testing the tool with just a single example. It also makes the directory tests more robust.

This will help the working example documentation. https://github.com/Living-with-machines/alto2txt/issues/28


Tests to add

xml_to_text_entry.py

xml_to_text.py

xml_to_text.py

andrewphilipsmith commented 2 years ago

👍 I agree that this is a good idea and that it will help with both testing the tool and with "getting started" scenarios.

Ideally, it would be possible to entirely abstract the definition of the directory tree into a config file. I think it would be possible to do this with a couple of regexs - one for the mets files and another for the content files. The regexs' named groups would be hard coded, but the directory structure would not.

# An example file path (`0002647` refers to "The Statesman")
/0002647/1806/0724/0002647_18060724_0001.xml

# Regex that extracts the metadata from the directory structure:
(?P<pubid>\d{7})/(?P<year>\d{4})/(?P<month>\d{2})(?P<day>\d{2}).+(?P<pageno>\d{4})\.xml

# Regex that extracts the metadata from the file name:
(?P<pubid>\d{7})_(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})_(?P<pageno>\d{4})

I think that the tests above cover most things. A couple more thoughts on how this might be implemented:

davanstrien commented 2 years ago

My own 2cents are that this would be better done as later work after we alto2txt is made public. Although this tool will hopefully be useful for others, it's ultimately not likely to have a huge user base so it's probably better to define new features based on requests from users rather than trying to anticipate all possible workflows? If you have the capacity to work on this though feel free to implement this.

kallewesterling commented 1 year ago

I would like to raise this as important as I had been thinking of a great blog post with a case study of our alto2txt tool + BL's publicly released newspapers. The problem, as I have also written on Slack, is that the XML files uploaded to the BL repository of the newspapers that have been digitised by the project, have been packaged up in a different folder format than that expected by the alto2txt tool currently.

An alternative for us would be restructuring the files on the repository, but they have already been uploaded so I guess it's a bit late for that, and (b) it would be a possibility to write a go-between script that unzips the files and then renames/restructures files to follow the alto2txt standard.

But - for reusability etc. I’d imagine that it’d be good for alto2txt to have a more agnostic approach to the file structure anyway. I.e.: Not all ALTO files will come in the shape of a folder with the collection name in the top level, then NLP, then year, finally month/day and then the XML files in there...