Open karindalziel opened 2 years ago
Steps for turning metadata to csv:
It looks like finding deeply nested nodes isn't very easy in python, so ruby and nokogiri will probably be the way to go.
Edit: elementTree does have some XPathing capabilities so it is possible to find a given tag easily
namespaces are kinda tricky because it's not easy to find them if they change dynamically
Solutions are:
1). explicitly declare the namespace at the beginning namespace = {'space' : 'http://www.tei-c.org/ns/1.0'} )
The issue here is if namespaces change for other files, this isn't easily usable for everything. It would work for ardhi but only ardhi
2). don't care about the namespace and just find all the tags root.findall( './/{*}title' )
this works, but it may grab unintended titles/tags?
In the data repository (https://github.com/CDRH/data_ardhi) use the datura scripts to create a CSV with the metadata from the TEI files. As a proof of concept we can just start with it and title, and then add data from there once we have this set up.
This will involve overwriting several methods to generate the output file, probalby in the file_csv.rb file
I'm not sure if we have any examples of this already, links below are to work done in file_csv.rb but that may not be the limit of files that need to be overwritten
Ardhi data repo: https://github.com/CDRH/data_ardhi TEI: https://github.com/CDRH/data_ardhi/tree/main/source/tei
first steps will be looking at languages to pull the data from the TEI: Ruby, Python most likely