Closed freelawbot closed 6 years ago
Comment by johnhawkinson Thursday Sep 12, 2013 at 02:18 GMT
Same problem to day in Floyd v. City of New York (stop and frisk case) in ca2, 13-3088, docket 44.
9/11/13 9:55:24.676 PM [0x0-0x35035].org.mozilla.firefox: RECAP: Exception 9/11/13 9:55:24.676 PM [0x0-0x35035].org.mozilla.firefox: RECAP: After getting META 9/11/13 9:55:24.676 PM [0x0-0x35035].org.mozilla.firefox: RECAP: After name 9/11/13 9:55:24.676 PM [0x0-0x35035].org.mozilla.firefox: RECAP: Url: /cmecf/servlet/TransportRoom?servlet=ShowDoc&dls_id=00202736667&caseId=20140&dktType=dktPublic 9/11/13 9:55:24.676 PM [0x0-0x35035].org.mozilla.firefox: RECAP: Name: 00202736667.pdf 9/11/13 9:55:24.676 PM [0x0-0x35035].org.mozilla.firefox: RECAP: List of stuff: 9/11/13 9:55:24.676 PM [0x0-0x35035].org.mozilla.firefox: RECAP: {"mimetype":"application/pdf","court":"ca2","name":"00202736667.pdf","url":"/cmecf/servlet/TransportRoom?servlet=ShowDoc&dls_id=00202736667&caseId=20140&dktType=dktPublic"} ... 9/11/13 9:55:26.276 PM [0x0-0x35035].org.mozilla.firefox: RECAP: Posting file: 00202736667.pdf
and doesn't show up on archive.org. Though the docket.html updated. http://ia601004.us.archive.org/8/items/gov.uscourts.ca2.13-3088/
Comment by johnhawkinson Tuesday Oct 01, 2013 at 15:50 GMT
Oh, I think part of the problem relates to docket entries with attachments. gov.uscourts.ca2.13-3088.114.0.pdf worked just fine, but then none of the 3 subdocs of 115 correctly renamed or uploaded.
Issues #47 and #44 are both duplicates of this one, so I'll be closing them shortly.
The problem as I currently understand it is:
None
as a case number.Where we get None
, the backend explodes and says:
Truncated incorrect DOUBLE value: 'None'
Where we get the case number, it says:
Truncated incorrect DOUBLE value: '15-5075'
In both cases, this crashes the IA uploader, and, fun fact, no joke, no item with a case number that's alphabetically after these values will get uploaded.
So, this is all rather bad.
The solutions here are:
[ ] The extensions need fixing so they get the correct values. On the backend, this problem has identified (through failure) the following problematic scrapers:
gov.uscourts.cafc.None.88.0.pdf
gov.uscourts.ca5.None.docket.xml
gov.uscourts.ca2.None.docket.xml
gov.uscourts.cafc.None.docket.xml
gov.uscourts.ca7.None.docket.xml
gov.uscourts.ca9.None.1.3.pdf
These will each need review.
uploads_bucketlock.casenum
is a varchar(30)
, as it's defined in its Django model.I believe this is the biggest issue with RECAP right now.
cc: @johnhawkinson, @carlmalamud
@Johnhawkinson, you'll be pleased to learn that your file naming issue is resolved in freelawproject/recap-firefox@d50b6b3cb66ba1de7138ae23ec7898b79673f2cc
Remainder of this issue is still at large though. I shall press on.
@harlanyu, @dkapadia, @sjschultze, I put a bunch more time into this today and I think I found the issue, but I have a question for you guys.
It appears that certain versions of PACER have changed their the POST data that is sent when you request an appellate docket sheet.
On old versions (like CA6), you'd have a GET request like:
https://ecf.ca6.uscourts.gov/cmecf/servlet/TransportRoom?
servlet=CaseSummary.jsp&
caseNum=15-1019&
incOrigDkt=Y&
incDktEntries=Y
And we could easily say that the case number was the value of caseNum
. Great.
In the new version (like CA9), the GET request is:
https://ecf.ca9.uscourts.gov/n/beam/servlet/TransportRoom
Not too helpful. But we collect the following from the POST request:
Content-Type: application/x-www-form-urlencoded
Content-Length: 196
servlet=CaseSummary.jsp&
caseId=267130&
fullDocketReport=Y&
incOrigDkt=Y&
incPrior=Y&
incAssoc=Y&
incPtyAty=Y&
incCaption=long&
incDktEntries=Y&
dateFrom=&
dateTo=&
incPdfMulti=Y&
actionType=Run+Docket+Report
There's a parameter in there for caseID
, which we could start using for the case number, but I'm not sure if that's what we want to do since the value is clearly not the same as the docket number (which in this case is 15-80056).
Do you guys have insight?
I'll just say, ca9 has moved to CM/ECF NextGen, so it's not surprising things are different there.
Also, I've found it MUCH BETTER to have urls like
http://archive.org/download/gov.uscourts.ca2.13-3088/
than to have
http://archive.org/download/gov.uscourts.nysd.320470
especially because of two bugs:
(1) The RECAP server search engine is broken and you can't rely on it to search for a docket number and get back the archive.org URL. Instead you have to go to CM/ECF and do a query and run a docket report and check the URL for the [R] icon links.
(2) Oftentimes a single case will return multiple case id numbers and that means the docket report on archive.org is broken into two parts, with no way to figure out which is which. For instance:
http://archive.org/download/gov.uscourts.mad.160895 http://archive.org/download/gov.uscourts.mad.160894
I think there is another open issue about this problem. But it really makes the RECAP docket...less than optimally useful. If those were instead
http://archive.org/download/gov.uscourts.mad.1:14-cr-10143
it would be a much better experience.
Really using an internal identifier as user-facing is a data management mistake. It's super-hard to fix (flag day! compatability!) with district court RECAP, but please let's not "fix" the appellate CM/ECF to have the same problem.
I vote for the official docket number.
I also like hyphens if the court likes hyphens.
Well, I agree with the last two comments that, ideally, we'd use real docket numbers in our archive.org URLs because it has always been a pain to find the PACER case id as @johnhawkinson explained. However, what @mlissner seems to be telling us is that the post data, by itself, is NOT giving us the docket number, but it is giving us the caseid#. So, it'd be a good bit easier to use data they are providing than to hunt-and-peck around trying to find the data we'd prefer.
CL also puts "internal" docid #s into its urls and I've always hated it. Just ask Mike if I favor "predictable URLs" and he can show you a couple hundred pages of emails about the topic. But the problem with docket numbers as they are used by our federal courts is that only if we use the form "1:14-cr-10143" are they actually unique (and I wouldn't be surprised to learn of collisions even when adding the first number and the cr/cv/bk designations.) So, if the courts cannot be relied upon to use unique identifiers, then we cannot adopt their almost-unique-identifiers where a unique identifier is required. I think if we can find a reliable way to retrieve the FULL docket numbers, with preceding colon-separated digit and with letter codes, then I'd be willing to try to use those until we learn for sure that they aren't unique. I really really hope they're unique, but the courts have never failed to let me down. Also, I don't know whether we can reliably retrieve them. That'll be @mlissner's call.
Not the breakthrough on this issue we're looking for, but I've converted the casenum field to a varchar so it conforms with the model.
Closing this monster bug. Hopefully we'll get this resolved as a by-product of adding appellate court support in #83.
Issue by johnhawkinson Tuesday Aug 06, 2013 at 16:23 GMT Originally opened as https://github.com/freelawproject/recap-server/issues/38
In USA v. Auernheimer (weev) 13-1816 at ca3, i.e. http://ia601700.us.archive.org/17/items/gov.uscourts.ca3.13-1816/gov.uscourts.ca3.13-1816.docket.html, I just tried to download documnet 003011347514 0 ECF FILER: Response filed by Appellant Andrew Auernheimer to Motion to Accept Noncompliant filing, Motion stay request. Certificate of Service dated 08/05/2013. (HMF) from yesterday.
I ended up with ca3-Tra0sportRoom?servlet=ShowDoc&dls_id=003011347514&caseId=87236&dktType=dktPublic.pdf in my filesystem, and I don't know if anything was successfully uploaded to archive.org (looks like not), though docket metadata made it (unsurprisingly). debugging output was not fatal-looking:
8/6/13 12:18:07.546 PM [0x0-0x17e77e6].org.mozilla.firefox: RECAP: Url: /cmecf/servlet/TransportRoom?servlet=ShowDoc&dls_id=003011347514&caseId=87236&dktType=dktPublic 8/6/13 12:18:07.546 PM [0x0-0x17e77e6].org.mozilla.firefox: RECAP: Name: 003011347514.pdf 8/6/13 12:18:07.546 PM [0x0-0x17e77e6].org.mozilla.firefox: RECAP: List of stuff: ... 8/6/13 12:18:12.243 PM [0x0-0x17e77e6].org.mozilla.firefox: RECAP: Posting file: 003011347514.pdf 8/6/13 12:18:21.506 PM [0x0-0x17e77e6].org.mozilla.firefox: RECAP: RECAP File Upload - PDF uploaded to the public archive. 8/6/13 12:18:21.509 PM [0x0-0x17e77e6].org.mozilla.firefox: RECAP: [object Object]