Open grossir opened 4 months ago
Thanks @grossir -
Another approach for non obvious gaps is using the search_citation
table
We have 8 citations types. On some of these types, such as NEUTRAL
, the page may actually be a serial number. For a given reporter and volume, we would expect the page to be a sequence 1,2,3,4... That way, it would be easy to detect missing citations, and through some clever joining, get the date range to back scrape the source
For example, we have 2018 CO 1 and 2018 CO 3, but we are missing 2018 CO 2
Sometimes these gaps may be false positives, because a single opinion may be citable through different systems/reporters (parallel citation?). I will sample at least one opinion of each gap to see if that's the case
Citation types for reference:
FEDERAL = 1
STATE = 2
STATE_REGIONAL = 3
SPECIALTY = 4
SCOTUS_EARLY = 5
LEXIS = 6
WEST = 7
NEUTRAL = 8
CITATION_TYPES = (
(FEDERAL, "A federal reporter citation (e.g. 5 F. 55)"),
(
STATE,
"A citation in a state-based reporter (e.g. Alabama Reports)",
),
(
STATE_REGIONAL,
"A citation in a regional reporter (e.g. Atlantic Reporter)",
),
(
SPECIALTY,
"A citation in a specialty reporter (e.g. Lawyers' Edition)",
),
(
SCOTUS_EARLY,
"A citation in an early SCOTUS reporter (e.g. 5 Black. 55)",
),
(LEXIS, "A citation in the Lexis system (e.g. 5 LEXIS 55)"),
(WEST, "A citation in the WestLaw system (e.g. 5 WL 55)"),
(NEUTRAL, "A vendor neutral citation (e.g. 2013 FL 1)"),
)
NEUTRAL
citationAnalysis of these did point to a gap, probably caused because the source we scrape publishes only a subset of opinions
Neb. Ct. App.
This reporter has type NEUTRAL
but the "page" value is actually a page. See source
The source allows to filter by year decided, and order by neutral citation. So, it is easy to inspect what we are missing
We have "missing" pages. However, after sampling the gaps, we do have the opinions on Courtlistener. We are just missing the neutral citations
These comments, @grossir are really exciting. The fact that we're not only tracking such gaps at this point, but also fixing them, is an exciting milestone. Thank you!
To keep track of it, we have no data for Missouri Supreme Court mo
since May 2022. Hopefully it can be solved once freelawproject/courtlistener#3996 is completed
I think there are 3 big classes of gaps:
0 gap: when we have 0 documents for a time period, and have a regular count before and after. We would be able to see this clearly on a graph, as a hole in a
date_filed
histogram. This may be the easiest to catch, both visually and with a DB querysubtle gap: we are missing opinions/documents, but it is not obvious that we are missing them. We have a regular amount of data for the time period where we are missing them. These gaps may be detected by comparing against source totals. These totals may be got from the same scraped source, or from some statistics published by the court. We may want collect the totals for important courts, such as SCOTUS.
backscraper gap: when opinions are available for the past but we haven't ran the backscrapers to get them, We would have to check every source and see how far into the past it goes, and decide if we wan't to collect that data. Prioritization may be done by citation volume, as a metric of importance of the opinions from that source
0-gap
I think this is the easiest analysis, so I checked it on the dev database and compared it to the live DB with CL searches. Here we have some examples, the date the 0-gap spans, and the amount of missing documents. On my dev DB queries the gap sometimes spans more time, probably a backscraper was run after the DB snapshot was done
alaska
Bug in the scraper is the cause of the gaps. See #937
arkctapp
From June 5th, 2019 to February 9th, 2022, we have 0 documents. Missing around 1100 docs: 586 docs in 2020, 511 in 2021, and some more from 2019/2022 (Source)
ark
From June 21st, 2019 to May 13th, 2020 we have 0 documents. Around 300 missing documents from second half 2019 and first half 2020
Between September 11, 2020 and March 03, 2021 we have 0 documents. We are missing around 100 documents
From March 19th, 2021 to December 1st, 2021 we have 0 documents. Around 130 missing documents
colo
0 documents between June 5th, 2021 and January 30, 2022. I have identified some missing opinions, but they are not many
Between June 21, 2022 and January 09, 2023 we have 0 documents. There are documents for this time period on the source
coloctapp
Between September 29, 2021 and February 02, 2022 we have 0 documents. We are missing documents, but must go into PDFs to get them now
conn
Between May 25, 2022 and January 09, 2023 we have 0 documents. We are missing documents becasue the scraper looks for this string
Published in the Law Journal
, but in that year the format looks likePublished in the Connecticut Law Journal
fla
Between September 06, 2019 and December 18, 2019 we have 0 documents. We are missing 35 documents
illappct
Between October 22, 2019 and May 21, 2020 we have 0 documents. We are missing around 1600 documents (152 per page, 11 full pages)
Between May 29, 2021 and November 15, 2021 we have 0 documents. We are missing around 1350 documents
ky
Between December 28th, 2017 and May 16th, 2020 , there are 6 documents in CL. There should be 626 according to the source
I notice that the interface I found is different from the endpoints used in the scraper. This endpoint contains the opinion type, too (For example
AFFIRMING IN PART AND REVERSING IN PART AND REMANDING
,AFFIRMING AND REMANDING
, etc)mass
This source doesn't list old opinions, one has to go to external providers (recommended by the state source itself) to find them.
Between September 30th, 2021 and February 6th, 2023 we have 0 documents. According to FindLaw there are 261 Supreme Court opinions
massappct
This source doesn't list old opinions, one has to go to external providers to find them.
Between May 24th, 2021 and February 7th, 2023 we only have 8 opinions. According to FindLaw, there are 562.
md
0 opinions for 2020. Missing 143 documents
mdctspecapp
Between November 1st, 2019 and January 27, 2021, we have 0 documents. Missing 114 documents from 2020, and around 30 from 2019.
miss
Between October 25, 2019 and February 12, 2020 we have 0 documents. We are missing documents, but they are difficult to count. For example
MARSHA R. HINTON AND THOMAS F. HINTON v. SPORTSMAN’S GUIDE, INC
will not find resultsny
Between April 27, 2018 and February 13, 2019 we have 1 document1. We are missing 92 documents
Between June 16, 2023 and October 18, 2023 we have 0 documents. We are missing 2 documents
Between June 04, 2021 and October 06, 2021 we have 0 documents. We are missing 5 documents
nyappterm
Between June 15th, 2020 and February 2nd, 2023 we have 5 documents in CL. From the source this amounts to more than 900 documents: 399 from Appellate Term, 1st Dept, and more than 500 from 2nd Dept.
nysupct
This applies to
nysupct_commercial
scraper. The currentnysupct
scraper was recently implemented as part ofnytrial
. We have 34 documents from June 11th, 2020 to April 25th, 2023. These documents have a bunch of duplicates. There should be 246 for that time period, according to the source.nm
Between 2019-04-11 and 2021-03-04 we have 6 opinions. Should have 80
Between 2021-04-13 and 2022-01-09 we have 0 opinions, should have 30.
nmctapp
Between June 7th, 2019 and April 1st, 2022 we have 3 documents in CL. From the source, we are missing 1265 documents
nc
Between September 25, 2020 and February 05, 2021 we have 0 documents. We are missing around 50 published opinions from late 2020.
ncctapp
From November 4, 2020 to January 1st, 2022 we have 0 documents. There is data in the source for this time period
pa
Between May 28, 2021 and November 16, 2021 we have 0 documents. There are 1261 documents in that time period, some of which are not opinions
pasuperct
Between May 28, 2021 and November 17, 2021 we have 0 documents. We are missing around 1504 "Non Precedential" documents and 164 Precedential opinions (source)
pacommwct
Between May 28, 2021 and August 23, 2021 we have 0 documents. We are missing 49 precedential opinions
sd
We have a gap from August 30th, 2019 to January 4th, 2023, which should amount to a little over 240 missing documents (totals are not so easy to count here)
tex
Between May 05, 2020 and August 20, 2020 we have 2 documents. We are missing 47 documents
texapp
Between May 08, 2020 and July 29, 2020 we have 0 documents. We are missing 3006 documents spread among the 14 texapp courts
texcrimapp
Between May 07, 2020 and August 18, 2020 we have 0 documents. We are missing 46 documents
This is not strictly a 0-gap, we have some documents, but given the volume of opinions this court has, a 0-gap candidate with data seemed strange to me: Between April 27, 2017 and July 25, 2017 we have 18 documents. For that time period, there are 207 opinions available.
texag
Gap from March 11, 2019 to February 10, 2023. Which amount to 199 missing documents (52 in 2019, 68 in 2020, 51 in 2021, 28 in 2022)
vtsuperct
Between June 05, 2020 and March 31st, 2022, we have 4 documents. We are missing \~175 documents from civil, criminal, family and environmental courts.
Between November 16th, 2017 and October 25th, 2019 we have 3 documents. We are missing around 375 documents from civil, criminal, family and environmental courts. Note that these are not only opinions, but orders and decisions too. The scraper does not filter them
Between March 23, 2017 and August 18, 2017 we have 0 documents.
Between January 17, 2020 and May 15, 2020 we have 0 documents.
wyo
We have a gap in opinions from August 28th, 2019 to May 17th 2022. The source reports 445 opinions for that time period.
afcca
Between February 1st, 2021 and February 1st 2023, we have 0 documents in CL. We are missing more than 250 documents from 2021 and 2022 (198)
bap9
Between July 07, 2021 and January 24, 2022 we have 0 documents. We are missing documents, which can be found filtering by document type (no filter view will not go that far back in time)
bap10
We have 0 documents between July 21th, 2021 and January 23th, 2023. From the source, we are missing around 10 documents
olc
We have 0 documents between January 16th, 2020 and July 6th, 2021. Missing 16 documents
Also missing this document for range 2013-10-21 to 2014-06-16
tax
0 documents between November 20th, 2020 and January 26th, 2022. Filtering by those dates on the source, there are more than 200 docs (tried splitting the range in half, there is still more than 100 in each half)
nmcca
Between October 04, 2021 and January 04, 2022 we have 0 documents. There is data in the source for this time period
Other gaps
False 0 gaps
mspb
shows a gap from 2019-02-15 to 2022-03-18, but there seems to be no data for that period on the source.nyag
shows many gaps, but in the source itself the documents are sparse. It could be that it publishes little opinions, but we should double check if the link we are scraping is the proper one.delctapp
: Between March 19, 2020 and July 29, 2020 we have 0 documents, but there is no data for that period on the sourceariz
: Between September 16, 2019 and January 23, 2020 we have 0 documents. There is no data in the source for that periodindtc
has 0 documents between May 24, 2018 and September 24, 2018 we have 0 documents. But there is no data in the soruce for that periodla
between June 26, 2019 and October 22, 2019 we have 0 documents. No Opinions on the site for that time period.ri
Between June 27, 2017 and October 20, 2017 we have 0 documents, a 4 month gap. This coincides with the court terms, there are no opinions for this time periodiowa
seems to have gaps, but there are no documents in the following periods. Must be related to court terms: Between June 30, 2021 and October 15, 2021 we have 0 documents. Between June 30, 2022 and October 14, 2022 we have 0 documents. Between June 30, 2023 and October 13, 2023 we have 0 documents.date_approximate
bia
,texag
are using date_filed_is_approximate, so opinions are clustered into July of each year. Difficult to assess if there is a gapquery
This is the query I am using, which has some steps but I think is readable and runs in reasonable time
Will writeup the rest of this next week, but at first look, I see 46 distinct sources with 0-gaps wider than 365 days on the dev database. Some of them have multiple big gaps