acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
432 stars 292 forks source link

Broken & missing links on the server #264

Open mbollmann opened 5 years ago

mbollmann commented 5 years ago

I have crosschecked a full file list from the aclweb.org server (created by @mjpost on 29.03.2019) with what would be expected after parsing the Anthology XML.

The result is a list of files that are either missing (= they should currently be linked on the website, but will 404) or unexpected (= they are on the server, but not currently linked).

Most recent status in this comment.

It reveals a swath of problems, for example:

What next?

Lines that stem from clear mistakes in the XML could obviously be manually fixed.

For journals that have front matter, and also for full volume PDFs, we could mark in the XML if these files actually exist or not (e.g., by providing—and relying on—an explicit <file> or <url type="internal"> tag, as discussed in #156.

I can also update the gist after we commit corrections and/or I get an updated file list.

mjpost commented 5 years ago

Regarding the TACL false alarms, those were fixed by @aryamccarthy in#242 (after I made this list).

mbollmann commented 5 years ago

If you can send me an updated file list, I'll update this one. Might make sense to first figure out which problems are actual problems.

mjpost commented 5 years ago

Just did.

mbollmann commented 5 years ago

Updated the list. This and the recent corrections (including my unmerged ones in #267) remove about 54 files from the list.

davidweichiang commented 5 years ago

There are also broken external PDF links for LREC.

mjpost commented 5 years ago

Issue #31 notes missing volume and front-matter for LREC 2014.

What if we added a pdf-missing attribute to <paper> and <volume> tags? This would record the fact that the PDF is missing, and allow us to not generate the links.

It'd be nice to resolve this since it would let us close out three or four issues.

davidweichiang commented 5 years ago

I think papers should always have an explicit <url> field and the absence of it should indicate that there isn't one. As for the whole-volume PDFs...maybe they should be automatically created during the build if they are missing?

mjpost commented 5 years ago

Ah, right, #156.

Yes, I think a <url> tag is the right way to go, with relative or absolute URIs. If the value is just the Anthology ID (e.g., P18-1234), the canonical URI prefix will be prepended (which can be adjusted for mirrors). If it is absolute, it stays absolute.

Some papers have <paper href="...">, which we would want to consolidate.

mjpost commented 5 years ago

@mbollmann, do you want to re-run your script? I think that these should all be fixed since I did manual checks with #324. Here's a new ls-lR.gz.

I'm going to close this out in the meantime; feel free to reopen it.

mbollmann commented 5 years ago

@mbollmann, do you want to re-run your script? I think that these should all be fixed since I did manual checks with #324. Here's a new ls-lR.gz.

This is the full list: https://pastebin.com/sP0q16wF

It contains a lot of clutter in the form of .xhtml and .ps files, and it lists all of Y18 as missing, which is (I suspect) since it was added after you generated the ls-LR.

Therefore, here's a list reduced to only PDF files that are not Y18: https://pastebin.com/d6R0AVaB

There still seem to be quite a few legitimately missing files, as well as an assortment of files (mostly supplementary material) that isn't linked from the website.

mjpost commented 5 years ago

I created a index if you want it: http://cs.jhu.edu/~post/tmp/ls-lR-2019-10-24.gz

mbollmann commented 4 years ago

Can you recreate the index some time @mjpost?

mjpost commented 4 years ago

http://cs.jhu.edu/~post/tmp/ls-lR-2020-04-16.gz

mjpost commented 4 years ago

(Let me know if a different format would be helpful)

mbollmann commented 4 years ago

http://cs.jhu.edu/~post/tmp/ls-lR-2020-04-16.gz

403 Forbidden. But since I need to recreate my script anyway, I believe the easiest-to-parse output would be ls -RQ1.

mjpost commented 4 years ago

http://cs.jhu.edu/~post/tmp/ls-RQ1-2020-04-18.gz

mbollmann commented 4 years ago

Some general stats:

Here's a gist with missing & unallocated PDFs & attachments. (I left out the thumbnails for brevity, also since I don't know how you produced them and whether it's expected that they're not there for all papers.)

mjpost commented 4 years ago

There are 50264 .png files (not in attachments or thumbnails folder) – what is their purpose?

These were probably created mistakenly. The entire img/ subdirectory is not actually on the server. I should probably add --delete to my rsync command.

There are 36876 .bib files – those are not our auto-generated files I assume, so they appear to be relics?

Yes, these are relics, that used to be put in place alongside the PDFs. We should probably delete them from the server, but I am reluctant to.

For PDF files, 20 expected PDFs are missing and 2107 PDF files are unallocated (i.e., shouldn't be linked to from anywhere). For attachments, 5 expected files are missing and 16 files are unallocated.

These should be easy fixes, right?

For thumbnails, 4848 images are missing and 1495 are unallocated.

Unallocated is strange. The Imagemagick command failed trying to generate thumbnails of some of the images.

mbollmann commented 4 years ago

These should be easy fixes, right?

Depends. Removing missing files from the XML is an easy fix. Unallocated revisions could also be attached to their respective paper, as could unallocated attachments.

The main questions are why they are missing/not linked to their respective paper in the first place. We might not want to link material that might have been excluded for a reason?

mjpost commented 4 years ago

I haven't looked at any of these, but my guess is that any errors are inadvertent and that we should restore them. I think we should operate on this assumption outside any evidence to the contrary. If it turns we are reverting something that was done intentionally, we can hope it will arise again and that next time, we'll document it in the XML.

mbollmann commented 4 years ago

Check out the latest commit 498ea1c for unallocated files that I could automatically add to the XML.

Here's a couple that my one-time script wanted to link up, but I removed again:

There's also a bunch of files that are in the right places and look like correct ACL IDs, but are not actually in the XML. Some of these might be retracted papers as noted in #760, I didn't check them yet, but I'm copy-pasting the full list here for reference.

WARNING  For pdf/C/C94/C94-2213.pdf: couldn't find element ./volume[@id='2']/paper[@id='213'] in C94.xml
WARNING  For pdf/D/D19/D19-1479.pdf: couldn't find element ./volume[@id='1']/paper[@id='479'] in D19.xml
WARNING  For pdf/J/J05/J05-4007.pdf: couldn't find element ./volume[@id='4']/paper[@id='7'] in J05.xml
WARNING  For pdf/K/K19/K19-20.pdf: couldn't find element ./volume[@id='20'] in K19.xml
WARNING  For pdf/K/K19/K19-10.pdf: couldn't find element ./volume[@id='10'] in K19.xml
WARNING  For pdf/M/M98/M98-0023.pdf: couldn't find element ./volume[@id='0'] in M98.xml
WARNING  For pdf/N/N19/N19-1338.pdf: couldn't find element ./volume[@id='1']/paper[@id='338'] in N19.xml
WARNING  For pdf/P/P98/P98-2133.pdf: couldn't find element ./volume[@id='2']/paper[@id='133'] in P98.xml
WARNING  For pdf/P/P99/P99-1084.pdf: couldn't find element ./volume[@id='1']/paper[@id='84'] in P99.xml
WARNING  For pdf/P/P08/P08-30.pdf: couldn't find element ./volume[@id='30'] in P08.xml
WARNING  For pdf/P/P08/P08-40.pdf: couldn't find element ./volume[@id='40'] in P08.xml
WARNING  For pdf/Q/Q18/Q18-1250.pdf: couldn't find element ./volume[@id='1']/paper[@id='250'] in Q18.xml
WARNING  For pdf/Q/Q18/Q18-1248.pdf: couldn't find element ./volume[@id='1']/paper[@id='248'] in Q18.xml
WARNING  For pdf/Q/Q18/Q18-1241.pdf: couldn't find element ./volume[@id='1']/paper[@id='241'] in Q18.xml
WARNING  For pdf/Q/Q18/Q18-1240.pdf: couldn't find element ./volume[@id='1']/paper[@id='240'] in Q18.xml
WARNING  For pdf/Q/Q18/Q18-1247.pdf: couldn't find element ./volume[@id='1']/paper[@id='247'] in Q18.xml
WARNING  For pdf/Q/Q18/Q18-1249.pdf: couldn't find element ./volume[@id='1']/paper[@id='249'] in Q18.xml
WARNING  For pdf/Q/Q18/Q18-1242.pdf: couldn't find element ./volume[@id='1']/paper[@id='242'] in Q18.xml
WARNING  For pdf/S/S98/S98-2.pdf: couldn't find element ./volume[@id='2'] in S98.xml
WARNING  For pdf/U/U16/U16-1024.pdf: couldn't find element ./volume[@id='1']/paper[@id='24'] in U16.xml
WARNING  For pdf/W/W00/W00-1222.pdf: couldn't find element ./volume[@id='12']/paper[@id='22'] in W00.xml
WARNING  For pdf/W/W00/W00-0805.pdf: couldn't find element ./volume[@id='8']/paper[@id='5'] in W00.xml
WARNING  For pdf/W/W03/W03-2913.pdf: couldn't find element ./volume[@id='29']/paper[@id='13'] in W03.xml
WARNING  For pdf/W/W13/W13-3734.pdf: couldn't find element ./volume[@id='37']/paper[@id='34'] in W13.xml
WARNING  For pdf/W/W14/W14-0157.pdf: couldn't find element ./volume[@id='1']/paper[@id='57'] in W14.xml
WARNING  For pdf/W/W14/W14-0159.pdf: couldn't find element ./volume[@id='1']/paper[@id='59'] in W14.xml
WARNING  For pdf/W/W14/W14-5510.pdf: couldn't find element ./volume[@id='55']/paper[@id='10'] in W14.xml
WARNING  For pdf/W/W14/W14-0158.pdf: couldn't find element ./volume[@id='1']/paper[@id='58'] in W14.xml
WARNING  For pdf/W/W14/W14-0156.pdf: couldn't find element ./volume[@id='1']/paper[@id='56'] in W14.xml
WARNING  For pdf/W/W15/W15-4947.pdf: couldn't find element ./volume[@id='49']/paper[@id='47'] in W15.xml
WARNING  For pdf/W/W15/W15-5714.pdf: couldn't find element ./volume[@id='57']/paper[@id='14'] in W15.xml
WARNING  For pdf/W/W16/W16-3709.pdf: couldn't find element ./volume[@id='37']/paper[@id='9'] in W16.xml
WARNING  For pdf/W/W16/W16-3108.pdf: couldn't find element ./volume[@id='31']/paper[@id='8'] in W16.xml
WARNING  For pdf/W/W16/W16-6407.pdf: couldn't find element ./volume[@id='64']/paper[@id='7'] in W16.xml
WARNING  For pdf/W/W18/W18-5823.pdf: couldn't find element ./volume[@id='58']/paper[@id='23'] in W18.xml
WARNING  For pdf/W/W18/W18-3014.pdf: couldn't find element ./volume[@id='30']/paper[@id='14'] in W18.xml
WARNING  For pdf/W/W18/W18-5822.pdf: couldn't find element ./volume[@id='58']/paper[@id='22'] in W18.xml
WARNING  For pdf/W/W18/W18-5821.pdf: couldn't find element ./volume[@id='58']/paper[@id='21'] in W18.xml
WARNING  For pdf/W/W18/W18-5820.pdf: couldn't find element ./volume[@id='58']/paper[@id='20'] in W18.xml
WARNING  For pdf/W/W18/W18-591.pdf: couldn't find element ./volume[@id='591'] in W18.xml
akoehn commented 3 years ago

Do we want to recheck here? All files that are referenced in the XML are correct as of now (see #598).

This issue might still have some PDFs that should be in the XML but are not?