cmu-lib / dhweb_app

Application for editing, searching, and browsing historical DH conference abstracts
https://dh-abstracts.library.cmu.edu
MIT License
4 stars 1 forks source link

TEI-XML import format for conference abstracts #540

Open scottbot opened 4 years ago

scottbot commented 4 years ago

Per request (https://twitter.com/s_papastamkou/status/1310902488145506304 & https://twitter.com/marijnkoolen/status/1310590002817126401), we should probably figure out a format for data input. Perhaps it's standard DH conference XML, or perhaps it's long data form, such as:

WorkID | VariableType | Value1 | Value2 | Value3 | Value4 1 | Title | "This is a work" 1 | Author | 1 | "Weingart, Scott" | Libraries | Carnegie Mellon University 1 | Author | 2 | "Lincoln, Matt" | Libraries | Carnegie Mellon University 1 | Keyword | "Networks" ...

Then, theoretically, we could have an easy import, that would necessarily be followed by Nickoal or I cleaning up all the import data.

For new conferences (not yet in the database), we'd also need to specify how to give us data on new conferences.

Perhaps we could have a little public upload spot in github, that will act as a queue of work for us, and make it so that even if we don't get to conferences, the data will still be available in the public should people want it, or should we get to it much later.

lb42 commented 4 years ago

Is there a standard dB conference xml format? Really? With a schéma and e erything ?

scottbot commented 4 years ago

@lb42 Not formally standardized as far as I know, but ADHO's been relatively consistent in recent years (https://github.com/ADHO/) with xml outputs, so we might want to stick with what works and is becoming normal.

lb42 commented 4 years ago

Sure. I can send you entries for drha and for early tei conferences if u need

mdlincoln commented 3 years ago

Picking this up re: dhd-boas-app convo with @scottbot today. The three issues are:

  1. Importing conference / conference series data
  2. Importing abstracts for a conference
  3. Make import code usable from front-end/admin interface

Importing conference / conference series data

I wrote the existing management command import_conferences.py to load the original spreadsheet @scottbot and @nickoal put together (re: #344). Using a spreadsheet with the format specified in that issue should work to bulk load new conference / series data.

Importing abstracts for a conference

I wrote import_dh_xml.py to import folders of TEI following the format in https://github.com/ADHO/. This imports one folder full of TEI for a single conference at a time, taking the database ID of an already-entered conference and then the path to the directory of TEI files.

The ADHO TEI format will correctly separate first and last names, capture work_type from the category element and keywords from the keywords element in that TEI. It gets sticky for matching up affiliations. The affiliation element in ADHO TEI is just a string. The parsing strategy I wrote was:

<affiliation>token1..., token1.3, token1.2, token1.1; token2..., token2.3, token 2.2, token2.1</affiliation>
  1. Split affiliation string on ; to process multiple affiliations
  2. Tokenize affiliation string by ,. Starting from the right-most token, try to match
    1. Country
    2. City
    3. Institution
    4. Any remaining tokens are kept together and treated as the "Department" field of an Affiliation linked to any matched Institution

This strategy kept the parsing flexible enough that both "Carnegie Mellon University" as well as "Carnegie Mellon University, Pittsburgh, USA" would get successfully matched.

Make import code usable from the front-end/admin interface

Assuming no changes to the underlying parsing code, this would take 2-4 days of work. The extra uncertainty is depending on how troublesome file upload management proves to be.

scottbot commented 3 years ago

We should provide an exemplar TEI file that someone can match. Additionally, perhaps we can include what the file ought to include to deal with the affiliation element, to separate keywords from topics, etc.

mdlincoln commented 3 years ago

Makes sense. I can work off of one of the existing openly-licensed TEI docs and add in some comments to it.

mdlincoln commented 3 years ago

First draft of a sample TEI-XML file up here: https://github.com/cmu-lib/dhweb_app/blob/master/dh_abstracts/app/abstracts/static/files/abstract_tei.xml

@scottbot lmk if the comments make sense.

lb42 commented 3 years ago

If this is a sample of what you might hope people will produce for you, maybe some comments on its use of TEI would be helpful. So here are a few -- the sample has some text in Spanish and some in English. It would be nice to distinguish these (use the xml:lang attribute) -- the text has lots of elements with @rend="normal", presumably artefacts of some conversion from Word vel sim. Are these much use for anything? Just use <hi> if you want to show italicized text -- what is the sample <publicationStmt> for? It should contain publication details of the TEI file itself if any. A simple <p> with some content like 'Sample for Name of Project' would do -- is this meant to conform to a specific TEI Schema or not? "based on the format accepted by ..." is a bit vague! -- TEI says, under spec for <abstract> "Any abstract already present in the source document should be encoded as a div within the front, as it should for a born-digital document." Not in the <body>. Just sayin.

mdlincoln commented 3 years ago

@lb42 all great points for ADHO that produces this TEI. Scott this should definitely be a convo with... Christof Schöch?

scottbot commented 3 years ago

I'm not entirely certain. @christofs @djakacki we've got some dhconvalidator questions - to whom do we direct them?

mdlincoln commented 3 years ago

@christofs @djakacki a warning: I am not a TEI expert! However I've tried to mock up the kind of structure we would benefit from for parsing author affiliations:

        <author>
          <persName>
            <surname>Afanador-Llach</surname>
            <forename>Maria Jose</forename>
          </persName>
          <affiliation>
            <orgName type="department">Departamento de Historia</orgName>
            <orgName>Universidad de los Andes</orgName>
            <district>Bogotá</district>
            <country>Colombia</country>
          </affiliation>
          <affiliation>
            <orgName>Fundación Histórica Neogranadina</orgName>
            <country>Colombia</country>
          </affiliation>
          <email>mj.afanador28@uniandes.edu.co</email>
        </author>

I would love to know if there are preferred methods for denoting a sub-organization such as "Departamento de Hisotria" within a larger organization "Universidad de los Andes".

@lb42 raises important issues about the use of editionsStmt and publicationStmt fields - while the DHConvalidator documentation does explain how to set these fields, hosts don't always seem to correctly follow the instructions.

There is also an issue in the way that topics are parsed from ConfTool - see how the term resource creation and discovery has been split across two elements

        <keywords scheme="ConfTool" n="topics">
          <term>archives</term>
          <term>repositories</term>
          <term>sustainability and preservation</term>
          <term>film and media studies</term>
          <term>historical studies</term>
          <term>digitisation</term>
          <term>resource creation</term>
          <term>and discovery</term>
          <term>crowdsourcing</term>
          <term>Spanish</term>
          <term>library &amp; information science</term>
          <term>globalization &amp; digital divides</term>
        </keywords>

I also agree with @lb42 that xml:lang attributes would be very useful for us as well as for others using these TEI-XML.

scottbot commented 3 years ago

Looping in @jamescummings, since Matt and I are a bit out of our depth with respect to TEI. James, what do you think of our import sample (https://github.com/cmu-lib/dhweb_app/blob/master/dh_abstracts/app/abstracts/static/files/abstract_tei.xml)? And how would we go about offering something more generalized? Perhaps a validator / schema?

lb42 commented 3 years ago

As I noted above, your practice is mostly fine, but if you want to make the tagging more precise and useful, you should certainly consider making a customized schema which people could use to validate their entries. Doing that with TEI tools such as ODD would make it possible to generate generic documentation too.

scottbot commented 3 years ago

@lb42 Thanks! We've never worked in e.g., ODD, so hoping for a little guidance =]

lb42 commented 3 years ago

Any time! ODD is just a way of saying which bits of the TEI you want to use, and how you want to use them. You can be as vague or as fascistic as you like in deciding that, but ODD keeps you honest, and helps someone else build tools that won't be surprised by your data.

It would be useful to know who is steering the ship: is the idea to revise/improve the ADHO schema or to use it as a basis for a new one? What's the target user community for the schema?

scottbot commented 3 years ago

The target user community is people who want to send us abstracts to import into http://dh-abstracts.library.cmu.edu/. We'd like to keep it as close to ADHO's standards as possible, so we can just bulk ingest when a new conference occurs, but perhaps work on this can go towards improving the ADHO schema. But none of us are on any ADHO committees, so that would be up to them.

christofs commented 3 years ago

This is a very useful discussion, thanks for looping me in.

I'm also not on any ADHO committee anymore, but I am involved in some (ongoing) work for EADH for a follow-up technology to the DHConvalidator approach. In that context, work on a more stringent schema would also be useful in order to (a) have a more precise target for format conversion tools and (b) to potentially allow people to submit their own, hand-written TEI.

It would be important, however, not to constrain useful innovations to much, for example inclusion of authority data such as ORCIDs for contributors or other IDs for institutions, or fully-structured bibliographic entries (to name just two obvious candidates for improvements). Currently, afaik, but I haven't checked in a while, the schema is rather loose: https://github.com/ADHO/dhconvalidator/tree/master/src/main/resources/schema

scottbot commented 3 years ago

@christofs As far as the limited needs of the Index of DH Conferences is concerned, if there is extra TEI added later (for ORCIDs or what have you), that wouldn't matter greatly to us, since we can just pull out the information we recognize. So hopefully this wouldn't constrain innovation?

How far along is EADH in this work?

jamescummings commented 3 years ago

Hey @scottbot et al.

I had a quick look at the sample file provided. Most of my comments would echo @lb42's.

BTW, @christofs DHConvalidator seems to have a TEI ODD customisation from which it produced that schema at https://github.com/ADHO/dhconvalidator/blob/master/src/main/resources/schema/dhconvalidator.odd but it seems to be just an old version of the TEI MathML one... with the title changed... which doesn't really seem suitable to me. Given enough sample files we could come up with a TEI ODD that was much more specific answered some of the requests above, and was tighter to stop silly things but lax enough to account for variation in practice/hosts/conferences.

But on that sample file I'd note:

Those are the only two actual errors. Suggestions for improvement include:

I mean, clearly, it would be more useful if the component parts of the <bibl> were marked up... dates and titles and authors and such because, well potential for linking through to DOIs and bibliometic analysis. But I recognise that is asking too much. ;-)

Sorry, this is just me wittering, is there something more specific you want me to comment on?

mdlincoln commented 3 years ago

Thanks so much for your input @jamescummings!

For our purposes, we don't care much about the content of /TEI/text (but @christofs / ADHO should!) My priority is getting structured affiliation data.

although in real hard core TEI editing you'd just have an author name in with a ref attribute pointing to a element providing all the details about that person, affiliations, etc. I think in this kind of situation the way it is here is perfectly reasonable.

there is an type attribute of 'department' on orgName but not one for 'institution' or similar?

So is it preferable to do

<author>
  <persName>
    <surname>Afanador-Llach</surname>
    <forename>Maria Jose</forename>
  </persName>
  <affiliation>
    <orgName type="department">Departamento de Historia</orgName>
    <orgName type="institution">Universidad de los Andes</orgName>
    <district>Bogotá</district>
    <country>Colombia</country>
  </affiliation>
</author>

?

lb42 commented 3 years ago

And I, of course, agree with James!

Matthew, can you specify a list of appropriate organisation types? "institution" seems a bit vague.

mdlincoln commented 3 years ago

Matthew, can you specify a list of appropriate organisation types? "institution" seems a bit vague.

In our underlying model, a department is an optional child of a larger institution, so perhaps it is better to model this relationship structurally in the XML rather than relying on type attributes? I don't know how TEI-land usually handles this.

I can say that in our existing data, institutions cover everything from colleges to universities to national and local libraries and archives to museums, for-profit companies and non-profit orgs, foundations, etc. but we don't categorize them.

mdlincoln commented 3 years ago

For example...?

<org>
    <orgName>Universidad de los Andes</orgName>
    <org>
        <orgName>Departamento de Historia</orgName>
    </org>
</org>
lb42 commented 3 years ago
  1. As James already noted in passing, TEI is big on the difference between a named entity and references to a named entity. In your data you very frequently have (probably) the same entity referenced in different ways, so you (quite correctly) use xxxName to tag a reference to an xxx. You can't mix those references in with xxx elements: you're only allowed one xxx, and you don't want it cluttered up with the references to it anyway. But if you are saying that your "department" thingie is always a subpart of an "institution" thingie, yes, by all means model the subordination in XML , perhaps like this:

    <orgName>
    <name type="main">Universidad de los Andes</name>
    <name type="sub">Departamento de Historia</name>
    </orgName>

    (And I like this better because @type is here being used to categorise the name, not the thing named, as it should be)

  2. If you're using "institution" for absolutely any kind of organisation, that's fine: no more to say. If you do decide one day you'd like to make some distinctions, you could do it with a @type on <orgName>. But I think it would be much better to do that categorisation in a different document -- think of it as the index to the the whole collection, in which the "University of the Andes" will be defined once for all.

These encodings are meant to represent an ancient source document faithfully, right? So we wouldn't normalize names or spellings etc. But if we did want to do that, it would be in that index document.

mdlincoln commented 3 years ago

I've made a very lax schema here that meets our minimum requirements: https://github.com/cmu-lib/dhweb_app/blob/master/dh_abstracts/app/abstracts/static/tei/schema/dh_tei.xsd

Because this project doesn't have opinions about what goes in the text element - some of our data contributors won't have fully-structured TEI, just plain text - nor metadata elements like editionStmt, publicationStmt, sourceDesc etc., I've left those branches of the tree totally open-ended for now.

@christofs the key departures from the current DHConvalidator spec are in the affiliation elements for authors, which add structured elements for institutors, sub-institutions/departments, city, and country. If we need to add more wildcard xs:any elements to allow for potential additional attributes for authors or their stated affiliations, let me know what you think!

mdlincoln commented 3 years ago

Also to be done is adding some xs:documentation elements to explain each field

lb42 commented 3 years ago

Hm, sorry I don't grock XSD

scottbot commented 3 years ago

@lb42 If we were creating a schema / structure that you could read (and contribute conference presentations in), what would its ideal format be?

lb42 commented 3 years ago

@scottbot TEI ODD, of course! But ignore my previous not very helpful comment. I'm sure I can work it out.

lb42 commented 3 years ago

Hmm, well the trouble with this xsd schema is that it's invalid on at least two counts: the fairly simple one that attribute xml:lang isn't declared, and the less simple one that it proposes an ambiguous content model, which is more of a show stopper. If you say that an <author> contains a <persName> followed optionally by anything at all, when a parser encounters an <author> containing just a <persName>, it can't tell whether this is meant to match the <persName> or the anything at all branch. And do you really want to encourage people to do things like

<author>
<persName><!-- nicely structured persName here --></persName>
<persName>arbitrary rubbish with<weirdoTags/> all over the place</persName>
</author>

I think not.

But I am an XSD ignoramus, so I may be misunderstanding your intent completely.

mdlincoln commented 3 years ago

xml:lang is correctly imported with xs:import - you'll need to download that definition as well (hosted at the same path as our custom one: https://github.com/cmu-lib/dhweb_app/blob/master/dh_abstracts/app/abstracts/static/tei/schema/xml.xsd) to the same directory to validate.

Unless I'm mis-testing, I don't believe the schema is ambiguous for the section you state. Validation correctly throws errors if you try to submit the XML you paste there.

The current locations of the xs:any wildcards are to accommodate further elements as we continue discussions with @christofs / ADHO / DHConvalidator folks.

lb42 commented 3 years ago

Thanks for the clarification: I didn't see the xs:import : all I did was open the file you posted with oxyGen and try to make sense of its complaints. Where is the discussion to which you refer happening? Anyway, being at a loose end this afternoon, I started hacking out an ODD which implements what I thought your schema was trying to do. Draft attached for your amusement: if you run this through the standard odd conversions it generates vaguely plausible doc and schema files dhCon.zip

mdlincoln commented 3 years ago

@lb42 I appreciate your work especially on the ODD, which is a level up from what I'm used to!

Please don't spend much more time on this until we hear back from the other folks in this ticket - little use trying to make a broader standard until we actually hear from ADHO. This sample XSD is so they have a starting point, and we have a bare minimum for what our application needs. ADHO will have more capacious needs most likely.