Open mbollmann opened 5 years ago
Regarding the TACL false alarms, those were fixed by @aryamccarthy in#242 (after I made this list).
If you can send me an updated file list, I'll update this one. Might make sense to first figure out which problems are actual problems.
Just did.
Updated the list. This and the recent corrections (including my unmerged ones in #267) remove about 54 files from the list.
There are also broken external PDF links for LREC.
Issue #31 notes missing volume and front-matter for LREC 2014.
What if we added a pdf-missing
attribute to <paper>
and <volume>
tags? This would record the fact that the PDF is missing, and allow us to not generate the links.
It'd be nice to resolve this since it would let us close out three or four issues.
I think papers should always have an explicit <url>
field and the absence of it should indicate that there isn't one. As for the whole-volume PDFs...maybe they should be automatically created during the build if they are missing?
Ah, right, #156.
Yes, I think a <url>
tag is the right way to go, with relative or absolute URIs. If the value is just the Anthology ID (e.g., P18-1234
), the canonical URI prefix will be prepended (which can be adjusted for mirrors). If it is absolute, it stays absolute.
Some papers have <paper href="...">
, which we would want to consolidate.
@mbollmann, do you want to re-run your script? I think that these should all be fixed since I did manual checks with #324. Here's a new ls-lR.gz.
I'm going to close this out in the meantime; feel free to reopen it.
@mbollmann, do you want to re-run your script? I think that these should all be fixed since I did manual checks with #324. Here's a new ls-lR.gz.
This is the full list: https://pastebin.com/sP0q16wF
It contains a lot of clutter in the form of .xhtml
and .ps
files, and it lists all of Y18 as missing, which is (I suspect) since it was added after you generated the ls-LR.
Therefore, here's a list reduced to only PDF files that are not Y18: https://pastebin.com/d6R0AVaB
There still seem to be quite a few legitimately missing files, as well as an assortment of files (mostly supplementary material) that isn't linked from the website.
I created a index if you want it: http://cs.jhu.edu/~post/tmp/ls-lR-2019-10-24.gz
Can you recreate the index some time @mjpost?
(Let me know if a different format would be helpful)
403 Forbidden. But since I need to recreate my script anyway, I believe the easiest-to-parse output would be ls -RQ1
.
Some general stats:
.png
files (not in attachments or thumbnails folder) – what is their purpose?.bib
files – those are not our auto-generated files I assume, so they appear to be relics?Here's a gist with missing & unallocated PDFs & attachments. (I left out the thumbnails for brevity, also since I don't know how you produced them and whether it's expected that they're not there for all papers.)
There are 50264 .png files (not in attachments or thumbnails folder) – what is their purpose?
These were probably created mistakenly. The entire img/
subdirectory is not actually on the server. I should probably add --delete
to my rsync
command.
There are 36876 .bib files – those are not our auto-generated files I assume, so they appear to be relics?
Yes, these are relics, that used to be put in place alongside the PDFs. We should probably delete them from the server, but I am reluctant to.
For PDF files, 20 expected PDFs are missing and 2107 PDF files are unallocated (i.e., shouldn't be linked to from anywhere). For attachments, 5 expected files are missing and 16 files are unallocated.
These should be easy fixes, right?
For thumbnails, 4848 images are missing and 1495 are unallocated.
Unallocated is strange. The Imagemagick command failed trying to generate thumbnails of some of the images.
These should be easy fixes, right?
Depends. Removing missing files from the XML is an easy fix. Unallocated revisions could also be attached to their respective paper, as could unallocated attachments.
The main questions are why they are missing/not linked to their respective paper in the first place. We might not want to link material that might have been excluded for a reason?
I haven't looked at any of these, but my guess is that any errors are inadvertent and that we should restore them. I think we should operate on this assumption outside any evidence to the contrary. If it turns we are reverting something that was done intentionally, we can hope it will arise again and that next time, we'll document it in the XML.
Check out the latest commit 498ea1c for unallocated files that I could automatically add to the XML.
Here's a couple that my one-time script wanted to link up, but I removed again:
L08-1001.pdf, L08-1002.pdf, L08-1003.pdf
– these (and only these from L08) exist on the server, but are actually hosted (and linked to) externallyW14-3321.Dataset.zip
– there already is W14-3321.Datasets.zip
W19-0423v1.pdf
– no v2 existsW19-36*.pdf
– while PDFs for all of the papers seem to exist on the server (according to the CRC32 file Matt posted), none of them are linked and trying to access them manually actually gives 404s. Is that intentional?W19-5030v5.pdf, W19-5030v6.pdf
– these exist, but there's no v3 or v4, so I'm not sure what's going on here.There's also a bunch of files that are in the right places and look like correct ACL IDs, but are not actually in the XML. Some of these might be retracted papers as noted in #760, I didn't check them yet, but I'm copy-pasting the full list here for reference.
WARNING For pdf/C/C94/C94-2213.pdf: couldn't find element ./volume[@id='2']/paper[@id='213'] in C94.xml
WARNING For pdf/D/D19/D19-1479.pdf: couldn't find element ./volume[@id='1']/paper[@id='479'] in D19.xml
WARNING For pdf/J/J05/J05-4007.pdf: couldn't find element ./volume[@id='4']/paper[@id='7'] in J05.xml
WARNING For pdf/K/K19/K19-20.pdf: couldn't find element ./volume[@id='20'] in K19.xml
WARNING For pdf/K/K19/K19-10.pdf: couldn't find element ./volume[@id='10'] in K19.xml
WARNING For pdf/M/M98/M98-0023.pdf: couldn't find element ./volume[@id='0'] in M98.xml
WARNING For pdf/N/N19/N19-1338.pdf: couldn't find element ./volume[@id='1']/paper[@id='338'] in N19.xml
WARNING For pdf/P/P98/P98-2133.pdf: couldn't find element ./volume[@id='2']/paper[@id='133'] in P98.xml
WARNING For pdf/P/P99/P99-1084.pdf: couldn't find element ./volume[@id='1']/paper[@id='84'] in P99.xml
WARNING For pdf/P/P08/P08-30.pdf: couldn't find element ./volume[@id='30'] in P08.xml
WARNING For pdf/P/P08/P08-40.pdf: couldn't find element ./volume[@id='40'] in P08.xml
WARNING For pdf/Q/Q18/Q18-1250.pdf: couldn't find element ./volume[@id='1']/paper[@id='250'] in Q18.xml
WARNING For pdf/Q/Q18/Q18-1248.pdf: couldn't find element ./volume[@id='1']/paper[@id='248'] in Q18.xml
WARNING For pdf/Q/Q18/Q18-1241.pdf: couldn't find element ./volume[@id='1']/paper[@id='241'] in Q18.xml
WARNING For pdf/Q/Q18/Q18-1240.pdf: couldn't find element ./volume[@id='1']/paper[@id='240'] in Q18.xml
WARNING For pdf/Q/Q18/Q18-1247.pdf: couldn't find element ./volume[@id='1']/paper[@id='247'] in Q18.xml
WARNING For pdf/Q/Q18/Q18-1249.pdf: couldn't find element ./volume[@id='1']/paper[@id='249'] in Q18.xml
WARNING For pdf/Q/Q18/Q18-1242.pdf: couldn't find element ./volume[@id='1']/paper[@id='242'] in Q18.xml
WARNING For pdf/S/S98/S98-2.pdf: couldn't find element ./volume[@id='2'] in S98.xml
WARNING For pdf/U/U16/U16-1024.pdf: couldn't find element ./volume[@id='1']/paper[@id='24'] in U16.xml
WARNING For pdf/W/W00/W00-1222.pdf: couldn't find element ./volume[@id='12']/paper[@id='22'] in W00.xml
WARNING For pdf/W/W00/W00-0805.pdf: couldn't find element ./volume[@id='8']/paper[@id='5'] in W00.xml
WARNING For pdf/W/W03/W03-2913.pdf: couldn't find element ./volume[@id='29']/paper[@id='13'] in W03.xml
WARNING For pdf/W/W13/W13-3734.pdf: couldn't find element ./volume[@id='37']/paper[@id='34'] in W13.xml
WARNING For pdf/W/W14/W14-0157.pdf: couldn't find element ./volume[@id='1']/paper[@id='57'] in W14.xml
WARNING For pdf/W/W14/W14-0159.pdf: couldn't find element ./volume[@id='1']/paper[@id='59'] in W14.xml
WARNING For pdf/W/W14/W14-5510.pdf: couldn't find element ./volume[@id='55']/paper[@id='10'] in W14.xml
WARNING For pdf/W/W14/W14-0158.pdf: couldn't find element ./volume[@id='1']/paper[@id='58'] in W14.xml
WARNING For pdf/W/W14/W14-0156.pdf: couldn't find element ./volume[@id='1']/paper[@id='56'] in W14.xml
WARNING For pdf/W/W15/W15-4947.pdf: couldn't find element ./volume[@id='49']/paper[@id='47'] in W15.xml
WARNING For pdf/W/W15/W15-5714.pdf: couldn't find element ./volume[@id='57']/paper[@id='14'] in W15.xml
WARNING For pdf/W/W16/W16-3709.pdf: couldn't find element ./volume[@id='37']/paper[@id='9'] in W16.xml
WARNING For pdf/W/W16/W16-3108.pdf: couldn't find element ./volume[@id='31']/paper[@id='8'] in W16.xml
WARNING For pdf/W/W16/W16-6407.pdf: couldn't find element ./volume[@id='64']/paper[@id='7'] in W16.xml
WARNING For pdf/W/W18/W18-5823.pdf: couldn't find element ./volume[@id='58']/paper[@id='23'] in W18.xml
WARNING For pdf/W/W18/W18-3014.pdf: couldn't find element ./volume[@id='30']/paper[@id='14'] in W18.xml
WARNING For pdf/W/W18/W18-5822.pdf: couldn't find element ./volume[@id='58']/paper[@id='22'] in W18.xml
WARNING For pdf/W/W18/W18-5821.pdf: couldn't find element ./volume[@id='58']/paper[@id='21'] in W18.xml
WARNING For pdf/W/W18/W18-5820.pdf: couldn't find element ./volume[@id='58']/paper[@id='20'] in W18.xml
WARNING For pdf/W/W18/W18-591.pdf: couldn't find element ./volume[@id='591'] in W18.xml
Do we want to recheck here? All files that are referenced in the XML are correct as of now (see #598).
This issue might still have some PDFs that should be in the XML but are not?
I have crosschecked a full file list from the aclweb.org server (created by @mjpost on 29.03.2019) with what would be expected after parsing the Anthology XML.
The result is a list of files that are either missing (= they should currently be linked on the website, but will 404) or unexpected (= they are on the server, but not currently linked).
Most recent status in this comment.
It reveals a swath of problems, for example:
Journals that have front matter (as discussed in #181) will show up as "unexpected", e.g.:
Unexpected: J98-1000.pdf
Attachments that appear to have wrong names in the XML, e.g.:
Unexpected: P16-1070.Notes.pdf Missing: P16-1070.Notes.zip
Something weird going on with EACL 1997; papers are listed twice—once as E97-, once as P97- (probably a joint meeting?)—with the E97-* files not actually existing on the server.
Some of them are also false alarms, e.g., a bunch of TACL papers show up as missing, such as:
But the URLs for them actually work: Q18-1006, Q18-1034, Q18-1035. The same applies to (many of?) the seemingly missing revisions & errata. Maybe there's some redirection magic going on on the server to places that are not included in the file list I've got?
Potentially many more.
What next?
Lines that stem from clear mistakes in the XML could obviously be manually fixed.
For journals that have front matter, and also for full volume PDFs, we could mark in the XML if these files actually exist or not (e.g., by providing—and relying on—an explicit
<file>
or<url type="internal">
tag, as discussed in #156.I can also update the gist after we commit corrections and/or I get an updated file list.