Align DHd XML with dh-abstracts XML submission requirements

scottbot commented 3 years ago

@PatrickHelling / @nubuker has uploaded DHd abstracts to github (https://github.com/DHd-Verband), using dhconvalidator, but they don't quite align with our XML schema for upload.

Since I moved jobs I'm no longer @mdlincoln's supervisor and can't put him to the task, and I certainly have no control over work Patrick or Harald does, but perhaps we can work together to align the formats? Feels like this could be an easy win, but I don't know enough XML/XSLT to get the job done. If nobody here has time, perhaps we can loop in another volunteer? Once the formats are aligned, I can do all the import, unique identity reconciliation for authors/institutions already existing in our database, etc.

cc @quinnanya, @nickoal, @ariddell, & @christofs just to keep everyone in the loop

Linking to other GH issue for additional context: #540 (XML format discussion) & #549 (XML samples documenting data submission requirements)

mdlincoln commented 3 years ago

I'd be eager to help get things aligned (albeit with the very limited time I have for non-DAMS migration work right now at CMU 😱)

@PatrickHelling @nubuker take a look at the draft XSD files in #549 - the most important addition compared to the dhconvalidator schema is the more strict structure required for expressing author institutions/affiliations.

PatrickHelling commented 3 years ago

@scottbot / @mdlincoln thank you very much for mentioning this. I wanted to get in touch with you myself after our workshop at the vDHd conference in March (https://vdhd2021.hypotheses.org/): indeed, shortly after we published 917 conference papers of all DHd conferences (2014-2020) on zenodo as individual publications (https://zenodo.org/communities/dhd/?page=1&size=20). In addition, we have now published the XML-files, PDF-files, metadata-files and publication lists as you have seen on GH.

It would be really great if we could add the abstracts to your Index of Digital Humanities Conferences. I’ve had a look at #549 and on that basis we are now working on the harmonization of the XML-files. However, for two conferences (2014 and 2015) there are no XML-files. Maybe a revisioned version of the metadata-files (as seen in the GH-repositories) could help one index these abstracts as well?

PatrickHelling commented 3 years ago

@scottbot / @mdlincoln it took a little while because of other tasks and issues, but we worked on some scripts to transform the abstracts of the DHd-conferences into the required XML-structures. I added an example file (Beispiel_Steyer_AbstractEnhancement.xml) to the following repository: transform-DHd-IoDHC maybe you can have a look at it? I am no expert in XML/XSLT but I guess it shoud meet your requirements regarding the Index of Digital Humanities Conferences. If so, please let me know so that we can transform all data we have collected and make it available to you.

scottbot commented 3 years ago

Thanks @PatrickHelling! @mdlincoln at your leisure would you mind checking the structure against our requirements, and if it fits, give the green light? From there, I can do the mass import + cleaning.

mdlincoln commented 3 years ago

Thanks for the example files @PatrickHelling! I was on vacation last week, so once I catch up with the backlog at work I wil give them a look (probably late this week or early next)

mdlincoln commented 3 years ago

The example file is very close to correct - and in fact, I had to adjust our validation schema to allow elements like xml:id, and fixed a broken link to our example TEI-XML file.

The pull request I made on your repository shows the 3 changes that are needed:

You must add xmlns="http://www.tei-c.org/ns/1.0" to the TEI element
xml:id cannot start with a number (I found this surprising, but it's in the XMLSchema definition apparently - so it was underlying XMLSchema rules, not our own custom DH Abstracts schema, that was throwing errors here)
the title and text elements need to have xml:lang attributes. In this case, xml:lang="de"

Let me know how that sounds.

mdlincoln commented 3 years ago

example valid TEI file: https://dh-abstracts.library.cmu.edu/static/tei/valid_tei/abstract_tei.xml

christofs commented 3 years ago

Awesome! Patrick will know more but we discussed plans to fix your (2) already; the other points should not be a big deal either, except that we may have a few abstracts in English in between and it would make sense to identify them.

scottbot commented 3 years ago

Glad this is doable @christofs! Let me know when you're able to get to it, and I can start importing.

PatrickHelling commented 3 years ago

@mdlincoln Thank you for your feedback. I added an new example file (Beispiel_Steyer_AbstractEnhancement_092021.xml) to the repository: transform-DHd-IoDHC maybe you can have a look at it again? Hopefully it meets your requirements now. If so, please let me know so that we can transform all data we have collected and make it available to you.

mdlincoln commented 3 years ago

FYI @PatrickHelling your new xml seems to work just fine 👍

PatrickHelling commented 2 years ago

@mdlincoln we finally added all data from 2016-2020 to the repository transform-DHd-IoDHC. Hopefully the data is now ready for your indexing process. If there are any further problems, please let me know.

mdlincoln commented 2 years ago

@PatrickHelling Wonderful.

One key change: replace http://tei-c.org/ns/2.0/ with http://tei-c.org/ns/1.0/ in all files.

PatrickHelling commented 2 years ago

@mdlincoln we changed http://tei-c.org/ns/1.0/ into http://tei-c.org/ns/2.0/ because otherwise the files are not valid. If it is 1.0, Oxygen XML Editor says that "fileDesc" is incomplete, missing required element "publicationStmt". In addition, some ids are not allowed here. Should we still replace it?

mdlincoln commented 2 years ago

@PatrickHelling can you please confirm which xml schema(s) your oxygen editor is using to validate? I want to check if our schema does not conform to TEI's schemae.

PatrickHelling commented 2 years ago

@mdlincoln I uploaded the files again and replaced http://tei-c.org/ns/2.0/ with http://tei-c.org/ns/1.0 (e.g. folder Data_IoDHC_DHd2020_1.0). I am not sure If I understand it correctly (I am not an expert in XML/XSLT), but by using the xml_IoDHC.xsd (you can find this file in the repo as well) for validation, I still receive some errors. Do the new files meet your requirements? If there are any further problems, please let me know.

mdlincoln commented 2 years ago

I see, you are using a different xsd file than we use for validation. The two files you need to validate against for ingest into our system right now are:

http://dh-abstracts.library.cmu.edu/static/tei/schema/dh_tei.xsd
http://dh-abstracts.library.cmu.edu/static/tei/schema/xml.xsd (this file is imported by dh_tei, so it should be sufficient for them to be in the same directory)

Your xml_IoDHC.xsd schema mandates additional XML tags, which is why you saw errors about publicationStmt. In the future it would probably be ideal to make sure that both our schemas are compatible with each other, though I don't have the time to dedicate to that right now unfortunately!

Please try using our schema. Most of the files are correct but several files in 2018 and 2019 have errors in the author tags.

scottbot commented 2 years ago

@PatrickHelling Do you think this is something that could be reconciled on your end? If so, I can start the import soon, as my life is starting to settle post-paternity leave.

PatrickHelling commented 2 years ago

@scottbot @mdlincoln Thank you very much for your support regarding the adjustment of the XML-files of the past DHd conferences so far. We were trying to validate our files against the schema posted by matthew but there still seem to be some difficulties and we are not able to identify the reasons why.

Errors that we receive when we try to validate the data against _dhtei.xsd:

Error for type '#AnonType_orgNameaffiliationauthortitleStmtfileDescteiHeaderTEI'. Multiple elements with name 'name', with different types, appear in the model group.
Error for type '#AnonType_textClassprofileDescteiHeaderTEI'. Multiple elements with name 'keywords', with different types, appear in the model group.
"http://www.tei-c.org/ns/1.0":profileDesc and WC[##any] (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles.

Until now, it seems that we are also not able to reproduce the errors in the author tags in the data of 2018/2019. But at least the data seems to be well formed.

Honestly, I am not sure what we can do to solve the problems. Maybe it is possible to have a closer look at it together and discuss the issue via Zoom?

nubuker commented 2 years ago

@scottbot @mdlincoln @PatrickHelling

Added a minimal correction of one file.

When using xmllint (Ubuntu 21.10):

All files (..._1.0/*.xml) validate against http://dh-abstracts.library.cmu.edu/static/tei/schema/dh_tei.xsd

Exception: 12 files: Validation fails because the mandatory element "author" is not present.

These files describe so-called "panels" (2019/2020) and have no "authors" in the DHd model 'by design'. The panel contributors with their presentations are modeled inline.

scottbot commented 2 years ago

Apologies for not responding earlier @nubuker @PatrickHelling. Thank you both so much for working this hard to reconcile your model with ours, and apologies that we don't have the cycles to help. I think the right solution here is to just feed everything into the system except the panels, if they're the only ones breaking the automated import, and then for me to insert the 12 panels manually.

If you point to the latest files, perhaps @mdlincoln can test one or two out to ensure everything is alright on our end, and if all seems in order I can proceed from there?

reborg789 commented 2 years ago

Hey, thanks a lot, no problem. We moved the panel files to the following folders: Data_IoDHC_DHd-2020_1.0_Panel and Data_IoDHC_DHd-2019_1.0_Panel inside of the folders: Data_IoDHC_DHd-2020_1.0 and Data_IoDHC_DHd-2019_1.0. Now the remaining files from the folders Data_IoDHC_DHd-2020_1.0 and Data_IoDHC_DHd-2019_1.0 should be inherited into the index without issues.

scottbot commented 2 years ago

Thanks @reborg789!

@mdlincoln Would you have some cycles to do an initial test, and if it passes I can do a bulk ingest?

mdlincoln commented 2 years ago

These files worked well @reborg789. I uploaded all files for 2016, 2017, and 2018, and all files except the panels for 2019 and 2020.

I submitted one correction for a typo in https://github.com/PatrickHelling/transform-DHd-IoDHC/pull/3. I uploaded the corrected version into https://dh-abstracts.library.cmu.edu/works?conference=204 already.

@scottbot I marked the fully-uploaded conferences as "Needs Review" and the ones needing manual panel entry as "Incomplete"

scottbot commented 2 years ago

Thanks @mdlincoln. Now for me to go through and do some data cleaning. I'm noticing weirdness around full text of works in each of the years uploaded (see, e.g., https://dh-abstracts.library.cmu.edu/works/10435) - any idea what's going on there?

mdlincoln commented 2 years ago

hm, looks like python is trying to display a bytes string rather than decoded UTF-8 - taking a look now

mdlincoln commented 2 years ago

I'm rolling back the import and will re-do it - hopefully correctly formatting the full text string this time!

scottbot commented 2 years ago

@mdlincoln Let me know when you're done; I'm already putting a candidate list together for author/affiliation merges.

@nubuker @PatrickHelling @reborg789 Are all the conference abstracts cc-by?

mdlincoln commented 2 years ago

@scottbot alright, decoding and unescaping seems to have run much better with these tweaks - take a look again

scottbot commented 2 years ago

Looks great, thanks @mdlincoln!

PatrickHelling commented 2 years ago

@scottbot thanks a lot! Yes, all abstracts are under cc-by 4.0 International.

scottbot commented 2 years ago

@PatrickHelling perfect! As you already saw, I updated the licenses accordingly and released the abstracts publicly. Thanks to you and everyone else for working through this with us.

PatrickHelling commented 2 years ago

@scottbot and @mdlincoln thanks a lot for the integration of the data and all your support! We added the folder "Data_IoDHC_DHd2022_1.0" to the repository. It contains the data of the last DHd conference 2022. Hopefully the data is ready for your indexing process.

PatrickHelling commented 2 years ago

@scottbot and @mdlincoln we were informed that in your index some old URLs were used for referring to the websites of DHd conferences. For sustaining the websites we moved some of them in static HTML versions to another server. Could you please change the following URLs in your index:

DHd 2017: new https://dhd2017.dig-hum.de DHd 2019: new https://dhd2019.dig-hum.de DHd 2020: new https://dhd2020.dig-hum.de

I also would like to ask if you already have had a closer look at the data of the DHd 2022 conference?

cmu-lib / dhweb_app

Align DHd XML with dh-abstracts XML submission requirements #559