A script to convert EMu XML exports to JSON for cultural data. Note - For record-sets > 1000, this can be slow. It is based on a combination of standards: Darwin core, Latimer core, Audubon Core and the very draft H2I extension to ABCD. We are currently testing it as a sustainable way to share data with partners from Digital Benin and Mapping Philippine Material Culture
Export a set of records as XML.
When possible, check that the XML output is well-formed.
xmllint --noout file.xml && echo $?
Set up these two CSV's following instructions/examples from the 'Input' section:
Clone this repo
Set local environment variables by adding a text file named .env in the root directory of this repo. Open it and follow the .env.example file. More info here if needed.
IN_PATH
= the path to your input XML fileMAP_PATH
= the path to your emu_conditions.csvOUT_PATH
& LOG_PATH
= the path where you want the output JSON and log files to goFROM_ADD
& TO_ADD1
= the sender and recipient email addresses for notifications'mutt
to send notifications from a serverInstall Python 3.9 or later. To send email notifications from a server, also install mutt. (e.g. Ubuntu wiki)
Install the python packages listed in required.txt with pip
or pip3
:
pip3 install charset-normalizer json pandas python-decouple xml xmltodict
Run the script: python3 emu_xml_to_json.py
Alternatively, you can manually specify a different XML-input like so:
python3 emu_xml_to_json.py data_in/2021-08-08/sample.xml
Output JSON, XML and log are zipped and emailed. See JSON output in emu_to_json.json, or check for errors in xml_log_YYYYMMDD.txt
An XML file containing records exported from EMu as XML, with some or all EMu-fields listed in emu_fields.csv
A 5-column CSV that maps EMu-column-names to corresponding standard-term names, using the following columns:
emu
= EMu column-names
json_field
= corresponding h2i standard term namesrepeatable
= blank or 'yes' to indicate if multiple values can be assigned to json_fieldemu_group
= in the EMu export's 'Group' name, or the table or Reference column name
emu
field.json_container
= the group name for a set of json_fields
that should be nested together in the output JSONA 7-column CSV that defines logic for conditionally redacting or mapping rows in multi-value-tables to standard terms.
if_field1
= the input EMu-field whose value defines a conditionif_logic1
= the logical comparison for the condition (e.g. if the field "IS" or "IS NOT" equal to if_value1)if_value1
= the input value.
then_field
= the input EMu-field (if any) whose value should be transformed or redacted.json_field
= the output json_field that should be set (conditionally) to the value in static_value
.
then_field
static_value
= the output value used if an input field matches conditions in the if_field1 & if_value1json_container
= the output field's group, if any