acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
409 stars 281 forks source link

Look at DOIs for frontmatter #2724

Open mjpost opened 1 year ago

mjpost commented 1 year ago

We didn't get DOIs in frontmatter for EMNLP 22 or ACL 23.

anthology-assist commented 1 year ago

I thought we do not ingest frontmatter anymore?

mjpost commented 1 year ago

We ingest it if it's there. Most of ACL had frontmatter. We need to check why we didn't assign DOIs. I am think we did in the past.

akoehn commented 1 year ago

You changed the logic floor them in the doi generation script recently -- maybe that has something to do with it? Did we change how frontmatters are represented in the xml and then had the doi generation script in am outdated state until you changed it?

mjpost commented 1 year ago

DOI ingestion is two steps:

  1. bin/generate_crossref_doi_metadata.py produces a big nasty XML file that we upload and use to generate DOIs
  2. bin/add_dois.py goes through each paper in a volume, checks if its DOI works, and if so, adds it to our XML

What I changed is (2) which was broken because it assumed there was always a <frontmatter> block, which there wasn't for EMNLP 2022, because they never delivered it. I didn't change (1). Looking to past frontmatters, we don't in fact generate a DOI for the volume itself. We probably should.

mjpost commented 1 year ago

Actually, though, this reminds me that I also change the ingestion script (post-EMNLP) to always generate the <frontmatter>. If there's no frontmatter PDF, we still need the block, we just don't generate the <url> tag inside it. We need to add this to EMNLP.

mbollmann commented 1 year ago

Whatever the reasoning for always generating the <frontmatter> block was, I still suspect it's the wrong solution to a problem I don't yet understand.

mjpost commented 1 year ago

<frontmatter> is just the special stub for paper 0. If we don't generate it, then no bibtex is generated for the volume itself. We want to generate this volume bibtex even if there is no PDF. (If there is a PDF, we add a <url> tag within frontmatter, as we do for papers.) This is all a separate issue.

We don't generate DOIs for the complete volume or frontmatter, and haven't for some time. If we want to, we just need to figure it out. I haven't had time to do this. See also #726.

mjpost commented 6 months ago

Re-upping this for this quarter—we should generate DOIs for front matter. This involves: