Create spreadsheet from TEI metadata

karindalziel commented 2 years ago

In the data repository (https://github.com/CDRH/data_ardhi) use the datura scripts to create a CSV with the metadata from the TEI files. As a proof of concept we can just start with it and title, and then add data from there once we have this set up.

This will involve overwriting several methods to generate the output file, probalby in the file_csv.rb file

I'm not sure if we have any examples of this already, links below are to work done in file_csv.rb but that may not be the limit of files that need to be overwritten

Ardhi data repo: https://github.com/CDRH/data_ardhi TEI: https://github.com/CDRH/data_ardhi/tree/main/source/tei

first steps will be looking at languages to pull the data from the TEI: Ruby, Python most likely

TrisCurd commented 2 years ago

Steps for turning metadata to csv:

Get the list of files we're pulling metadata from going through the directory
for each file, get the correct metadata
add that metadata to a list
once all files are read, turn the list into a csv file
save the csv file to the proper location

Python tutorial using pandas and xml libraries

TrisCurd commented 2 years ago

It looks like finding deeply nested nodes isn't very easy in python, so ruby and nokogiri will probably be the way to go.

Edit: elementTree does have some XPathing capabilities so it is possible to find a given tag easily

TrisCurd commented 2 years ago

namespaces are kinda tricky because it's not easy to find them if they change dynamically Solutions are: 1). explicitly declare the namespace at the beginning namespace = {'space' : 'http://www.tei-c.org/ns/1.0'} ) The issue here is if namespaces change for other files, this isn't easily usable for everything. It would work for ardhi but only ardhi

2). don't care about the namespace and just find all the tags root.findall( './/{*}title' ) this works, but it may grab unintended titles/tags?

CDRH / ardhi

Create spreadsheet from TEI metadata #30