Anthology mirrors - Githubissues

akoehn commented 5 years ago

As the anthology is currently not reachable (.htaccess bug?), I want to renew this topic. There are already two issues regarding this (#22 and #28), but both are more about sharing the old rails application than sharing the data.

It would be great to be able to create a mirror by rsync-ing (or similar) the underlying data not in this repository. IIRC, the code change to host a mirror under a different URL should be minimal. 35GB of data seem to be small enough to easily host it.

TODO:

[x] add hashes to internally linked files in the XML
[ ] make host configurable in python scripts generating the hugo pages and pass this from Makefile
[ ] create script to mirror papers with hashcode check

mbollmann commented 5 years ago

FWIW, the old Anthology included a seeding script that simply downloaded everything from the aclweb.org server via HTTP. That's bound to be slower and less efficient than syncing some other way, but it would be pretty simple to recreate this functionality.

akoehn commented 5 years ago

Yes, but with that approach, one has to either re-download the whole anthology or implement some book-keeping as to what is new and what isn't (and a list of all downloadable files is needed). Updates to existing files are not propagated, deletions are not noted, and so on.

A read-only rsync server (probably restricted to certain public keys) would be less hacky and more efficient. Given that a server where an rsync server can be run is at hand of course.

I don't currently have plans for working on it; I just wanted to have an issue for it because the downtime hindered my research :-)

mjpost commented 5 years ago

More later, but tagging #156, since mirrors won’t work until we get rid of absolute URLs.

akoehn commented 5 years ago

Two random points:

The PDF files are served without .pdf (see #179 for a reason). A mirror would therefore need to also perform the same redirecting (but that configuration is not publicly available afaik). Alternatively, the generator should obtain a switch to link to the files including .pdf to make rewrite rules unnecessary for mirrors.
The easiest solution I could come up with for performing the mirror is to host a file with all filenames and their hashes (sha512sum or similar, generated e.g. by running find . | xargs sha512sum). A simple ~5-liner on the client side could then check the locally available files and download the missing ones. No additional infrastructure on the server needed and integrity check included.

mjpost commented 5 years ago

The .htaccess file with the rewrite rules is in the repo. It has www.aclweb.org hard-coded; I wonder if that could be generalized to work for mirrors. If not, a mirror setup script could just do a replacement.

I like the on-demand downloading. I can generate the SHA1 hashes later.

akoehn commented 5 years ago

Ah, I overlooked it somehow.

Another question regarding mirrors: Are the additional resources (slides, data, code etc.) also mirrored and would that be a problem with licenses? The ACL papers and the COLING ones are licensed under a CC license, no problem there. But I am not sure about the rest.

knmnyn commented 5 years ago

Hi Arne, all:

We collected those resources in the past without asking for any specific license. I would think going forward we can make the licensing explicit to CC BY 4.0, but that would just be my opinion. - M

On Thu, May 23, 2019 at 3:46 AM Arne Köhn notifications@github.com wrote:

Ah, I overlooked it somehow.

Another question regarding mirrors: Are the additional resources (slides, data, code etc.) also mirrored and would that be a problem with licenses? The ACL papers and the COLING ones are licensed under a CC license, no problem there. But I am not sure about the rest.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/295?email_source=notifications&email_token=AABU725MOR2EI3HM24VLOLTPWWPI3A5CNFSM4HJ4UEGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWADUZA#issuecomment-494942820, or mute the thread https://github.com/notifications/unsubscribe-auth/AABU72Y64XOGM4LDOFKSC3DPWWPI3ANCNFSM4HJ4UEGA .

akoehn commented 5 years ago

Discussion from #333:

@mjpost

I've posted a file with checksums here [14 MB].

@akoehn

I can write the script & create a pull request later today; I am currently on a train with limited bandwidth. Adding the checksums file to the repository seems like a good idea to me.

We could probably save space by encoding the checksums using binary; I don't know whether that is worth the (little) added complexity.

mjpost commented 5 years ago

This is a great idea to add these—perhaps as a “checksum” attr on <url>? Thanks for volunteering to do this.

One request, though—can you wait till tomorrow? I need to finish #317, which is very close. It will be a minor pain to have to rebase once more if you push this in first.

akoehn commented 5 years ago

@mjpost the file contains .bib files. I assume they should not be mirrored?

No worry, I will not push anything. If we use checksum attrs, the code will be completely different from what I was originally going to implement. I like the approach as it makes sure every element is mirrorable (and we don't accidentally expose files such as .bib).

As the sha512sum file you posted is quite big, maybe let's settle for a shorter checksum? It is only for the client to check the integrity, we do not really need resilience to attacks. sha224 will save us ~3.5mb in the xml files and is still much more secure than we need.

mjpost commented 5 years ago

Ah, yes, the bibs are originals that aren't used. Please do ignore.

I guess I just mean, please be prepared to run this again once I have pushed in the nested format. It shouldn't be that difficult to update but may be easier if you have it in mind that you'll have to.

Yes, a shorter checksum would be great.

akoehn commented 5 years ago

I wrote a short bash script downloading the data. It does not reproduce the file structure on the ACL server, but recreates the structure as it is served (i.e. the /pdf/ directory is not used).

http://arne.chark.eu/tmp/mirror-papers.sh

Not committing it anywhere as a real solution with the xml will probably not share any code.

@davidweichiang You can try this to mirror the anthology pdfs. It should be fairly self explanatory.

davidweichiang commented 5 years ago

Thanks. It's better than nothing, right, so should we add it to the repo until someone writes another?

akoehn commented 5 years ago

Updated issue to reflect current state. Closes #348.

davidweichiang commented 5 years ago

I'm guessing that both the script above as well as the file of hashes are outdated. It would be great for the hashes to be autogenerated and for the script to become part of the repository.

mjpost commented 5 years ago

I could do this fairly quickly. Are we agreed that an optional sha224 attribute on <url> is the best approach?

akoehn commented 5 years ago

Yes, it seems to be the safe & future-proof thing to do. The only downside: sha224 is fairly big and would add ~3mb of data. I think that is okay but if we only want a single checksum, crc32 would only add about 400kb.

The schema should be changed so that the attribute is required on relative URIs:

element url {
    (xsd:anyURI {pattern="/.*" , attribute hash {xsd:string { minLength = "56" maxLength = "56" pattern="[0-9a-f]"* })
    | xsd:anyURI {pattern="https?://.*" }
}? // same for revisions etc.

Completely untested, of course :-)

mjpost commented 5 years ago

I thought you suggested sha224 because it's a shorter hash, but it seems longer to me. Using this site to hash "Arne":

MD5: 297f7ee0aad5b818bafa6044072c898e
SHA1: 2d7739f42ebd62662a710577d3d9078342a69dee
SHA224: 44b2cea251afaffb0682a40d125ffb77e7aa09b28c77b245cba3c3c4

So my new proposal is to add an md5 attribute on <url> tags. If I can get consensus I'll do the work.

akoehn commented 5 years ago

I only wrote that sha224 is shorter than sha512 :-)

The question is whether we only need a checksum (e.g. to verify that the download was not accidentally corrupted) or whether it should be a cryptographic hash (e.g. to guard against a third party trying to pass a specially crafted PDF as one of the PDFs of the anthology).

If we only need a checksum, crc32 should be sufficient and even shorter than MD5. If we want a cryptographic hash, we should at least use sha224, as both MD5 and SHA1 have known weaknesses. Maybe just use crc32, we can still distribute the hashes out of band later on for all the paranoid people out there ;-)

mjpost commented 5 years ago

I can't imagine a threat case where we need to guard against PDF replacement, though it sounds like the basis of a great entry for Bruce Schneier's (now-defunct) movie plot contests. So in that case I suggest crc32.

akoehn commented 4 years ago

Issues such as #730 might be another reason to have checksums.

mbollmann commented 4 years ago

If @mjpost can produce an up-to-date file with CRC32 checksums, I'm happy to add them to the XML sometime this week. Unless you had already begun making a script for that @akoehn ?

akoehn commented 4 years ago

Everything I did is linked here. Would love to work on it, but I unfortunately have no time for the anthology with the current child care closures :-((

If you add it, please add the relaxNG thing I posted above, with length adjusted for crc32 of course.

mjpost commented 4 years ago

I can reproduce this shortly. One issue here is that in our current model, revisions overwrite the default paper (e.g., revision two produces P19-1001v2.pdf which also overwrite P19-1001.pdf, so that we can return the latest version by default. This will complicate checksumming since they'll have to be updated every time we create a revision.

There are two ways we could deal with this:

Update checksums when adding revisions. I guess this wouldn't be too hard if it's in the XML
We could no longer overwrite the main PDF, which would mean that the https://aclweb.org/anthology/P19-1001.pdf shortcut would always return the original file. Perhaps this isn't too bad since we have the canonical page anyway.

I have always been dissatisfied with overwriting the PDF in this manner (the original is saved to "v1.pdf", of course) since it creates the potential for error (see #730) and overloads the meaning of the file name.

akoehn commented 4 years ago

I would overwrite the main pdf because I would assume to obtain the latest version with an unversioned link. Another reason is that it is otherwise very hard to programmatically obtain the current version of a paper (zotero, ebib, ...).

I agree that not overwriting the PDF would be the cleaner solution though :-/

The cleanest solution would probably to save all "v1" as v1 and have a symlink from the unversioned file name to the latest version. I don't know.

mbollmann commented 4 years ago

+1 to all of what @akoehn said. I think overwriting is the most practical solution.

We could maybe make sure that we already update all the scripts for ingestion/adding revisions etc. to automatically compute and add/update the checksum.

mbollmann commented 4 years ago

But that said, the existence of the v1 version is never made explicit anywhere in the XML file. Maybe we should change that, also so there actually is something to attach the checksum for that file to.

mjpost commented 4 years ago

The checksums are here, if someone wants to take a shot at this:

http://cs.jhu.edu/~post/tmp/crc32.txt.gz

mbollmann commented 4 years ago

Thanks @mjpost! What do we do with files that are currently missing (as per #264)? If we require the checksum in the RelaxNG schema as @akoehn suggested (and I think we should do that), the validation would fail. Should I add a dummy value (e.g. 00000000) for now, or how do we want to handle this?

Also, if there are no objections, I would go ahead and add <revision id="1" ...> entries for all papers that currently have revisions, so we can actually record the checksums of the original versions somewhere.

akoehn commented 4 years ago

Sounds good to me. Maybe we can use a regexp a la [0-9a-f] | x instead and mark missing ones as xxxxxxxxxx? Then there would be no collusion possibility. null

mjpost commented 4 years ago

I like the idea of adding a <revision id="" ...>` tag.

Should we fix #264 first? It should be easy to remove <url> lines where there is no paper being linked to, since this should be generating an error, anyway.

mbollmann commented 4 years ago

@mjpost Can you share checksums of the IWPT files (and possibly EAMT, if they're gonna be merged soon)? I've got everything ready to add checksums when I find a minute, but I need complete coverage of course.

mjpost commented 4 years ago

Here are IWPT. I think your changes will be ready before EAMT, and I'm still debugging the process, so I'll add the checksums after merging in your changes, if that's okay.

https://cs.jhu.edu/~post/tmp/iwpt-crc32.txt

mjpost commented 4 years ago

@mbollmann I just updated that file with volume-level PDFs (#810), you may wish to add those, too.

mjpost commented 3 years ago

Once we have mirroring working, and a permanent mirror in place, it would also be nice to setup a workflow to publish a live version of every branch, for previewing. We could define a permanent mirror with full site functionality (e.g., aclanthology.org), and with live branches at, say, aclanthology.org/dev/{branch_name}.

akoehn commented 3 years ago

tadaa: http://aclanthology.lst.uni-saarland.de/anthology/

attachments are not on the mirror atm, but that is purely due to miscalculated disk space and can be fixed. Will post the code soon.

Two observations:

the aclweb.org webserver is sloooooow! It took me about 7h(?) just to download all data, which should have been a ~1h job with my internet connection.
@mjpost I recall you saying that the anthology is 30gb, it is more like 51gb.

Re the mirror for dev: sure, that is even easier because the papers do not need to be mirrored (and that is what the work was mostly about)

akoehn commented 3 years ago

What I forgot: pleas click around and test whether anything is broken.

mjpost commented 3 years ago

The mirroring seems to work—but what about changing the prefix? Using both a sub-domain and a top-level directory is redundant. Would this build at the root level instead of under /anthology?

akoehn commented 3 years ago

@mjpost http://aclanthology.lst.uni-saarland.de/

akoehn commented 3 years ago

I think we are (finally) done with this, after merging #1124

acl-org / acl-anthology

Anthology mirrors #295