Open LydiaFrance opened 2 years ago
👍 I agree that this is a good idea and that it will help with both testing the tool and with "getting started" scenarios.
Ideally, it would be possible to entirely abstract the definition of the directory tree into a config file. I think it would be possible to do this with a couple of regexs - one for the mets
files and another for the content files. The regexs' named groups would be hard coded, but the directory structure would not.
# An example file path (`0002647` refers to "The Statesman")
/0002647/1806/0724/0002647_18060724_0001.xml
# Regex that extracts the metadata from the directory structure:
(?P<pubid>\d{7})/(?P<year>\d{4})/(?P<month>\d{2})(?P<day>\d{2}).+(?P<pageno>\d{4})\.xml
# Regex that extracts the metadata from the file name:
(?P<pubid>\d{7})_(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})_(?P<pageno>\d{4})
I think that the tests above cover most things. A couple more thoughts on how this might be implemented:
dir_struct_1
else try dir_struct_2
etc.... Would this user friendly or potentially introduce ambiguities?My own 2cents are that this would be better done as later work after we alto2txt is made public. Although this tool will hopefully be useful for others, it's ultimately not likely to have a huge user base so it's probably better to define new features based on requests from users rather than trying to anticipate all possible workflows? If you have the capacity to work on this though feel free to implement this.
I would like to raise this as important as I had been thinking of a great blog post with a case study of our alto2txt tool + BL's publicly released newspapers. The problem, as I have also written on Slack, is that the XML files uploaded to the BL repository of the newspapers that have been digitised by the project, have been packaged up in a different folder format than that expected by the alto2txt tool currently.
An alternative for us would be restructuring the files on the repository, but they have already been uploaded so I guess it's a bit late for that, and (b) it would be a possibility to write a go-between script that unzips the files and then renames/restructures files to follow the alto2txt standard.
But - for reusability etc. I’d imagine that it’d be good for alto2txt to have a more agnostic approach to the file structure anyway. I.e.: Not all ALTO files will come in the shape of a folder with the collection name in the top level, then NLP, then year, finally month/day and then the XML files in there...
Currently the code base has assumptions about the directory tree. Either:
Or
Adding capability to create the output directory tree using the mets file (which contains the publication, year, issue) or simply the filename (which contains the year and issue) or allowing a flat directory output.
This will be useful for people testing the tool with just a single example. It also makes the directory tests more robust.
This will help the working example documentation. https://github.com/Living-with-machines/alto2txt/issues/28
Tests to add
xml_to_text_entry.py
xml_to_text.py
Small bugs or quirks to check:
xml_to_text.py