acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
384 stars 256 forks source link

Revision woes #904

Open mjpost opened 4 years ago

mjpost commented 4 years ago

There is a lot of complexity and consequent room for user error in the current revisioning process:

I wonder if we should get rid of <revision> and <erratum> in favor of generic <url> tags with a revision attribute. It would also be nice to avoid renaming at all. This would break the ".pdf" URL shortcut, but maybe that doesn't matter—users would just have to visit the paper page to get the latest revision.

Originally posted by @mjpost in https://github.com/acl-org/acl-anthology/pull/903#issuecomment-655524448

akoehn commented 4 years ago

I wonder if we should get rid of and in favor of generic tags with a revision attribute.

No: We then would need a <url type="revision"> and <url type="erratum"> because they behave differently. A revision is the new default for the PDF button, an erratum is not. IMO we would not gain anything by that change.

mbollmann commented 4 years ago

What part of the process leaves room for user error that couldn't be addressed by having a single script run through all these steps?

(NB: I don't know exactly what the script you currently use are doing.)

mjpost commented 4 years ago

It's bin/add_revision.py, which

Today (#903) I made edits to make this work with new-style IDs, and ended up pulling the HTML page and saving it as "v1.pdf". This is fixed now (and I added some sanity checks), but in general it just seems like bad practice to be overwriting files at all. I would have lost them if I didn't have a backup.

Re: the <url> tag, it's a little unseemly to have both it and <revision> pointing to the PDF. Do you disagree? It'd be sweeping change (a possible point against it), but it makes more sense to me to have a <version rev="1"> tag (or something of that sort), whose absence would mean no PDF is available. We would keep <erratum> as is.

mbollmann commented 4 years ago

I have a deja-vu here: https://github.com/acl-org/acl-anthology/issues/295#issuecomment-624684119

mjpost commented 4 years ago

Ah, I thought this felt familiar. Okay, we stick with overwriting the PDF, and I will be more careful.

That leaves the tag discussion. Do you disagree that having <url> and <revision> (once there is one) is redundant?

I wonder if changing <url> to an explicit <pdf revid="1"> link would be more clear. In this scheme, revisions would just add additional <pdf> tags. <erratum> would stay the same. This might also help clarify a minor point of confusion between the notion of URL and PDF.

mbollmann commented 4 years ago

I agree it's not super elegant at the moment, but not fully redundant either. Assume there's one revision, then right now we produce links to [id].pdf, [id]v1.pdf, and [id]v2.pdf. Now, [id].pdf and whatever the latest revision is will always be identical, but we do always want the [id].pdf link. So in that sense, if anything, it's the [id]v2.pdf that's redundant.

So what you could do is something like:

<pdf revid="1" file="2020.test-test.42v1" />
<pdf revid="2" file="2020.test-test.42">This is a revision because of xyz.</pdf>

and have the website always link the entry with the highest revid to the big PDF button. But note that there wouldn't be a v2 anymore in this version, and if we added another revision, we'd have to rename 2020.test-test.42 to 2020.test-test.42v2, so if that's really more intuitive ... I don't know.