acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
384 stars 256 forks source link

L18 missing ELRA/LREC native IDs #823

Open leondz opened 4 years ago

leondz commented 4 years ago

The url and pdf fields in L18 both point to the Anthology PDF. For prior LRECs, these both pointed to the LREC-hosted PDF, which while not without issues did permit syncing up of other metadata across the sites. It might be easier, and avoid duplication, if both ACL and LREC paper IDs were listed in the metadata. Oh, and isn't L20 just around the corner?

(also not a correction, sorry)

mjpost commented 4 years ago

It would be fine to link to the LREC data (feel free to submit a PR to expedite, it's the <url> field in the <meta> block in data/xml/L18.xml, which needs to be changed from the Anth ID to a fully-specified URL).

LREC 20 ingestion is imminent. There has been some confusion and difficulty given LREC's size (30 workshops) and the new ID format.

mbollmann commented 4 years ago

It would be fine to link to the LREC data (feel free to submit a PR to expedite, it's the <url> field in the <meta> block in data/xml/L18.xml, which needs to be changed from the Anth ID to a fully-specified URL).

But isn't it still preferable for the Anthology to host files whenever possible? It would help with issues like #812.

If we want to add external links in addition to hosting the files ourselves, we'd need to add support for that in the XML.

mjpost commented 4 years ago

We always have them internally, but currently sometimes (I think maybe just for LREC) link to the PDFs externally, per request. In such cases I agree it'd be a good idea to provide both links.

leondz commented 4 years ago

Maybe I missed it in a doc, but what's the difference between url and pdf fields?

mbollmann commented 4 years ago

The XML only knows url, pointing to the PDF.

On the website, "URL" is intended to be the canonical link (the paper's landing page, usually) while "PDF" is the paper PDF itself. The semantics of these were changed following the discussion in #587. (I see they're identical for the externally-hosted papers, which I'm not sure is ideal...)

leondz commented 4 years ago

Given that externally-hosted material can be unreliable (see #812 ) and that the Anthology stores PDFs locally; and also that external URLs can be useful for e.g. disambiguation's sake; and that it's current practice to separate Anthology landing page URL from PDF; doesn't this mean that there are up to three URLs that make sense to store (pdf, anthology url, external url), but only two fields? I can see that doi can take the role of external URL for some content (e.g. CL papers, old things in the ACM DL), but for external content that does have a different ID and source URL, but no DOI (e.g. RANLP, LREC, NODALIDA, and surely others), what could be done? An external_uri field is one solution.

mjpost commented 4 years ago

The Anthology PDF is inferable from its URL, so we only store it once. I like the idea of adding an external or "original" URL, which would ideally point to a landing page, but could just point to a PDF, too, if that's all that's available.