Remove dependancy on input directory tree structure

LydiaFrance commented 2 years ago

Currently the code base has assumptions about the directory tree. Either:

    xml_in_dir
    |-- publication
    |   |-- year
    |   |   |-- issue
    |   |   |   |-- xml_content
    |   |-- year
    |-- publication

Or

    xml_in_dir
    |-- year
    |   |-- issue
    |   |   |-- xml_content
    |-- year

Adding capability to create the output directory tree using the mets file (which contains the publication, year, issue) or simply the filename (which contains the year and issue) or allowing a flat directory output.

This will be useful for people testing the tool with just a single example. It also makes the directory tests more robust.

This will help the working example documentation. https://github.com/Living-with-machines/alto2txt/issues/28

Tests to add

xml_to_text_entry.py

[ ] Does the directory tree look as expected with root, publication, year, issue?
[ ] Continue, don't automatically fail if not

xml_to_text.py

[ ] Does the filename have metadata in it?
[ ] Do the filename and directory metadata match?
[ ] If no directory tree, does the filename have metadata in it?
[ ] If no directory tree, no filename metadata, can the info come from the mets.xml file?
[ ] If flat directory input, should output be flat or directory tree invented?

Small bugs or quirks to check:

xml_to_text.py

[ ] Path to input and output files are created redundantly and unclear

andrewphilipsmith commented 2 years ago

👍 I agree that this is a good idea and that it will help with both testing the tool and with "getting started" scenarios.

Ideally, it would be possible to entirely abstract the definition of the directory tree into a config file. I think it would be possible to do this with a couple of regexs - one for the mets files and another for the content files. The regexs' named groups would be hard coded, but the directory structure would not.

# An example file path (`0002647` refers to "The Statesman")
/0002647/1806/0724/0002647_18060724_0001.xml

# Regex that extracts the metadata from the directory structure:
(?P<pubid>\d{7})/(?P<year>\d{4})/(?P<month>\d{2})(?P<day>\d{2}).+(?P<pageno>\d{4})\.xml

# Regex that extracts the metadata from the file name:
(?P<pubid>\d{7})_(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})_(?P<pageno>\d{4})

I think that the tests above cover most things. A couple more thoughts on how this might be implemented:

If we go down the config file route, how is this identified? A default location? A commandline switch? Env variable etc?
What happens if certain metadata fields cannot be obtained from the path / filename? Are there command line switches to provide the extra information?
Might we want multiple directory structures defined and alto2txt searches them to find the one appropriate for the available data? Eg try dir_struct_1 else try dir_struct_2 etc.... Would this user friendly or potentially introduce ambiguities?

davanstrien commented 2 years ago

My own 2cents are that this would be better done as later work after we alto2txt is made public. Although this tool will hopefully be useful for others, it's ultimately not likely to have a huge user base so it's probably better to define new features based on requests from users rather than trying to anticipate all possible workflows? If you have the capacity to work on this though feel free to implement this.

kallewesterling commented 1 year ago

I would like to raise this as important as I had been thinking of a great blog post with a case study of our alto2txt tool + BL's publicly released newspapers. The problem, as I have also written on Slack, is that the XML files uploaded to the BL repository of the newspapers that have been digitised by the project, have been packaged up in a different folder format than that expected by the alto2txt tool currently.

An alternative for us would be restructuring the files on the repository, but they have already been uploaded so I guess it's a bit late for that, and (b) it would be a possibility to write a go-between script that unzips the files and then renames/restructures files to follow the alto2txt standard.

But - for reusability etc. I’d imagine that it’d be good for alto2txt to have a more agnostic approach to the file structure anyway. I.e.: Not all ALTO files will come in the shape of a folder with the collection name in the top level, then NLP, then year, finally month/day and then the XML files in there...

Living-with-machines / alto2txt

Remove dependancy on input directory tree structure #30

Tests to add

Small bugs or quirks to check: