Closed akoehn closed 3 years ago
FWIW, the old Anthology included a seeding script that simply downloaded everything from the aclweb.org server via HTTP. That's bound to be slower and less efficient than syncing some other way, but it would be pretty simple to recreate this functionality.
Yes, but with that approach, one has to either re-download the whole anthology or implement some book-keeping as to what is new and what isn't (and a list of all downloadable files is needed). Updates to existing files are not propagated, deletions are not noted, and so on.
A read-only rsync server (probably restricted to certain public keys) would be less hacky and more efficient. Given that a server where an rsync server can be run is at hand of course.
I don't currently have plans for working on it; I just wanted to have an issue for it because the downtime hindered my research :-)
More later, but tagging #156, since mirrors won’t work until we get rid of absolute URLs.
Two random points:
.pdf
(see #179 for a reason). A mirror would therefore need to also perform the same redirecting (but that configuration is not publicly available afaik). Alternatively, the generator should obtain a switch to link to the files including .pdf
to make rewrite rules unnecessary for mirrors.find . | xargs sha512sum
). A simple ~5-liner on the client side could then check the locally available files and download the missing ones. No additional infrastructure on the server needed and integrity check included. The .htaccess file with the rewrite rules is in the repo. It has www.aclweb.org
hard-coded; I wonder if that could be generalized to work for mirrors. If not, a mirror setup script could just do a replacement.
I like the on-demand downloading. I can generate the SHA1 hashes later.
Ah, I overlooked it somehow.
Another question regarding mirrors: Are the additional resources (slides, data, code etc.) also mirrored and would that be a problem with licenses? The ACL papers and the COLING ones are licensed under a CC license, no problem there. But I am not sure about the rest.
Hi Arne, all:
We collected those resources in the past without asking for any specific license. I would think going forward we can make the licensing explicit to CC BY 4.0, but that would just be my opinion. - M
On Thu, May 23, 2019 at 3:46 AM Arne Köhn notifications@github.com wrote:
Ah, I overlooked it somehow.
Another question regarding mirrors: Are the additional resources (slides, data, code etc.) also mirrored and would that be a problem with licenses? The ACL papers and the COLING ones are licensed under a CC license, no problem there. But I am not sure about the rest.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/295?email_source=notifications&email_token=AABU725MOR2EI3HM24VLOLTPWWPI3A5CNFSM4HJ4UEGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWADUZA#issuecomment-494942820, or mute the thread https://github.com/notifications/unsubscribe-auth/AABU72Y64XOGM4LDOFKSC3DPWWPI3ANCNFSM4HJ4UEGA .
Discussion from #333:
@mjpost
I've posted a file with checksums here [14 MB].
@akoehn
I can write the script & create a pull request later today; I am currently on a train with limited bandwidth. Adding the checksums file to the repository seems like a good idea to me.
We could probably save space by encoding the checksums using binary; I don't know whether that is worth the (little) added complexity.
This is a great idea to add these—perhaps as a “checksum” attr on <url>
? Thanks for volunteering to do this.
One request, though—can you wait till tomorrow? I need to finish #317, which is very close. It will be a minor pain to have to rebase once more if you push this in first.
@mjpost the file contains .bib
files. I assume they should not be mirrored?
No worry, I will not push anything. If we use checksum attrs, the code will be completely different from what I was originally going to implement. I like the approach as it makes sure every element is mirrorable (and we don't accidentally expose files such as .bib
).
As the sha512sum file you posted is quite big, maybe let's settle for a shorter checksum? It is only for the client to check the integrity, we do not really need resilience to attacks. sha224 will save us ~3.5mb in the xml files and is still much more secure than we need.
Ah, yes, the bibs are originals that aren't used. Please do ignore.
I guess I just mean, please be prepared to run this again once I have pushed in the nested format. It shouldn't be that difficult to update but may be easier if you have it in mind that you'll have to.
Yes, a shorter checksum would be great.
I wrote a short bash script downloading the data. It does not reproduce the file structure on the ACL server, but recreates the structure as it is served (i.e. the /pdf/ directory is not used).
http://arne.chark.eu/tmp/mirror-papers.sh
Not committing it anywhere as a real solution with the xml will probably not share any code.
@davidweichiang You can try this to mirror the anthology pdfs. It should be fairly self explanatory.
Thanks. It's better than nothing, right, so should we add it to the repo until someone writes another?
Updated issue to reflect current state. Closes #348.
I'm guessing that both the script above as well as the file of hashes are outdated. It would be great for the hashes to be autogenerated and for the script to become part of the repository.
I could do this fairly quickly. Are we agreed that an optional sha224
attribute on <url>
is the best approach?
Yes, it seems to be the safe & future-proof thing to do. The only downside: sha224 is fairly big and would add ~3mb of data. I think that is okay but if we only want a single checksum, crc32 would only add about 400kb.
The schema should be changed so that the attribute is required on relative URIs:
element url {
(xsd:anyURI {pattern="/.*" , attribute hash {xsd:string { minLength = "56" maxLength = "56" pattern="[0-9a-f]"* })
| xsd:anyURI {pattern="https?://.*" }
}? // same for revisions etc.
Completely untested, of course :-)
I thought you suggested sha224 because it's a shorter hash, but it seems longer to me. Using this site to hash "Arne":
So my new proposal is to add an md5
attribute on <url>
tags. If I can get consensus I'll do the work.
I only wrote that sha224 is shorter than sha512 :-)
The question is whether we only need a checksum (e.g. to verify that the download was not accidentally corrupted) or whether it should be a cryptographic hash (e.g. to guard against a third party trying to pass a specially crafted PDF as one of the PDFs of the anthology).
If we only need a checksum, crc32 should be sufficient and even shorter than MD5. If we want a cryptographic hash, we should at least use sha224, as both MD5 and SHA1 have known weaknesses. Maybe just use crc32, we can still distribute the hashes out of band later on for all the paranoid people out there ;-)
I can't imagine a threat case where we need to guard against PDF replacement, though it sounds like the basis of a great entry for Bruce Schneier's (now-defunct) movie plot contests. So in that case I suggest crc32.
Issues such as #730 might be another reason to have checksums.
If @mjpost can produce an up-to-date file with CRC32 checksums, I'm happy to add them to the XML sometime this week. Unless you had already begun making a script for that @akoehn ?
Everything I did is linked here. Would love to work on it, but I unfortunately have no time for the anthology with the current child care closures :-((
If you add it, please add the relaxNG thing I posted above, with length adjusted for crc32 of course.
I can reproduce this shortly. One issue here is that in our current model, revisions overwrite the default paper (e.g., revision two produces P19-1001v2.pdf which also overwrite P19-1001.pdf, so that we can return the latest version by default. This will complicate checksumming since they'll have to be updated every time we create a revision.
There are two ways we could deal with this:
I have always been dissatisfied with overwriting the PDF in this manner (the original is saved to "v1.pdf", of course) since it creates the potential for error (see #730) and overloads the meaning of the file name.
I would overwrite the main pdf because I would assume to obtain the latest version with an unversioned link. Another reason is that it is otherwise very hard to programmatically obtain the current version of a paper (zotero, ebib, ...).
I agree that not overwriting the PDF would be the cleaner solution though :-/
The cleanest solution would probably to save all "v1" as v1 and have a symlink from the unversioned file name to the latest version. I don't know.
+1 to all of what @akoehn said. I think overwriting is the most practical solution.
We could maybe make sure that we already update all the scripts for ingestion/adding revisions etc. to automatically compute and add/update the checksum.
But that said, the existence of the v1
version is never made explicit anywhere in the XML file. Maybe we should change that, also so there actually is something to attach the checksum for that file to.
The checksums are here, if someone wants to take a shot at this:
Thanks @mjpost! What do we do with files that are currently missing (as per #264)? If we require the checksum in the RelaxNG schema as @akoehn suggested (and I think we should do that), the validation would fail. Should I add a dummy value (e.g. 00000000
) for now, or how do we want to handle this?
Also, if there are no objections, I would go ahead and add <revision id="1" ...>
entries for all papers that currently have revisions, so we can actually record the checksums of the original versions somewhere.
Sounds good to me. Maybe we can use a regexp a la [0-9a-f] | x instead and mark missing ones as xxxxxxxxxx? Then there would be no collusion possibility. null
I like the idea of adding a <revision id="
" ...>` tag.
Should we fix #264 first? It should be easy to remove <url>
lines where there is no paper being linked to, since this should be generating an error, anyway.
@mjpost Can you share checksums of the IWPT files (and possibly EAMT, if they're gonna be merged soon)? I've got everything ready to add checksums when I find a minute, but I need complete coverage of course.
Here are IWPT. I think your changes will be ready before EAMT, and I'm still debugging the process, so I'll add the checksums after merging in your changes, if that's okay.
@mbollmann I just updated that file with volume-level PDFs (#810), you may wish to add those, too.
Once we have mirroring working, and a permanent mirror in place, it would also be nice to setup a workflow to publish a live version of every branch, for previewing. We could define a permanent mirror with full site functionality (e.g., aclanthology.org
), and with live branches at, say, aclanthology.org/dev/{branch_name}
.
tadaa: http://aclanthology.lst.uni-saarland.de/anthology/
attachments are not on the mirror atm, but that is purely due to miscalculated disk space and can be fixed. Will post the code soon.
Two observations:
Re the mirror for dev: sure, that is even easier because the papers do not need to be mirrored (and that is what the work was mostly about)
What I forgot: pleas click around and test whether anything is broken.
The mirroring seems to work—but what about changing the prefix? Using both a sub-domain and a top-level directory is redundant. Would this build at the root level instead of under /anthology
?
I think we are (finally) done with this, after merging #1124
As the anthology is currently not reachable (.htaccess bug?), I want to renew this topic. There are already two issues regarding this (#22 and #28), but both are more about sharing the old rails application than sharing the data.
It would be great to be able to create a mirror by rsync-ing (or similar) the underlying data not in this repository. IIRC, the code change to host a mirror under a different URL should be minimal. 35GB of data seem to be small enough to easily host it.
TODO: