freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
341 stars 98 forks source link

Begin Gap Analysis #929

Open grossir opened 4 months ago

grossir commented 4 months ago

I think there are 3 big classes of gaps:

0-gap

I think this is the easiest analysis, so I checked it on the dev database and compared it to the live DB with CL searches. Here we have some examples, the date the 0-gap spans, and the amount of missing documents. On my dev DB queries the gap sometimes spans more time, probably a backscraper was run after the DB snapshot was done

alaska

Bug in the scraper is the cause of the gaps. See #937

arkctapp

From June 5th, 2019 to February 9th, 2022, we have 0 documents. Missing around 1100 docs: 586 docs in 2020, 511 in 2021, and some more from 2019/2022 (Source)

ark

From June 21st, 2019 to May 13th, 2020 we have 0 documents. Around 300 missing documents from second half 2019 and first half 2020

Between September 11, 2020 and March 03, 2021 we have 0 documents. We are missing around 100 documents

From March 19th, 2021 to December 1st, 2021 we have 0 documents. Around 130 missing documents

colo

0 documents between June 5th, 2021 and January 30, 2022. I have identified some missing opinions, but they are not many

Between June 21, 2022 and January 09, 2023 we have 0 documents. There are documents for this time period on the source

coloctapp

Between September 29, 2021 and February 02, 2022 we have 0 documents. We are missing documents, but must go into PDFs to get them now

conn

Between May 25, 2022 and January 09, 2023 we have 0 documents. We are missing documents becasue the scraper looks for this string Published in the Law Journal, but in that year the format looks like Published in the Connecticut Law Journal

fla

Between September 06, 2019 and December 18, 2019 we have 0 documents. We are missing 35 documents

illappct

Between October 22, 2019 and May 21, 2020 we have 0 documents. We are missing around 1600 documents (152 per page, 11 full pages)

Between May 29, 2021 and November 15, 2021 we have 0 documents. We are missing around 1350 documents

ky

Between December 28th, 2017 and May 16th, 2020 , there are 6 documents in CL. There should be 626 according to the source

I notice that the interface I found is different from the endpoints used in the scraper. This endpoint contains the opinion type, too (For example AFFIRMING IN PART AND REVERSING IN PART AND REMANDING, AFFIRMING AND REMANDING, etc)

mass

This source doesn't list old opinions, one has to go to external providers (recommended by the state source itself) to find them.

Between September 30th, 2021 and February 6th, 2023 we have 0 documents. According to FindLaw there are 261 Supreme Court opinions

massappct

This source doesn't list old opinions, one has to go to external providers to find them.

Between May 24th, 2021 and February 7th, 2023 we only have 8 opinions. According to FindLaw, there are 562.

md

0 opinions for 2020. Missing 143 documents

mdctspecapp

Between November 1st, 2019 and January 27, 2021, we have 0 documents. Missing 114 documents from 2020, and around 30 from 2019.

miss

Between October 25, 2019 and February 12, 2020 we have 0 documents. We are missing documents, but they are difficult to count. For example MARSHA R. HINTON AND THOMAS F. HINTON v. SPORTSMAN’S GUIDE, INC will not find results

ny

Between April 27, 2018 and February 13, 2019 we have 1 document1. We are missing 92 documents

Between June 16, 2023 and October 18, 2023 we have 0 documents. We are missing 2 documents

Between June 04, 2021 and October 06, 2021 we have 0 documents. We are missing 5 documents

nyappterm

Between June 15th, 2020 and February 2nd, 2023 we have 5 documents in CL. From the source this amounts to more than 900 documents: 399 from Appellate Term, 1st Dept, and more than 500 from 2nd Dept.

nysupct

This applies to nysupct_commercial scraper. The current nysupct scraper was recently implemented as part of nytrial. We have 34 documents from June 11th, 2020 to April 25th, 2023. These documents have a bunch of duplicates. There should be 246 for that time period, according to the source.

nm

Between 2019-04-11 and 2021-03-04 we have 6 opinions. Should have 80

Between 2021-04-13 and 2022-01-09 we have 0 opinions, should have 30.

nmctapp

Between June 7th, 2019 and April 1st, 2022 we have 3 documents in CL. From the source, we are missing 1265 documents

nc

Between September 25, 2020 and February 05, 2021 we have 0 documents. We are missing around 50 published opinions from late 2020.

ncctapp

From November 4, 2020 to January 1st, 2022 we have 0 documents. There is data in the source for this time period

pa

Between May 28, 2021 and November 16, 2021 we have 0 documents. There are 1261 documents in that time period, some of which are not opinions

pasuperct

Between May 28, 2021 and November 17, 2021 we have 0 documents. We are missing around 1504 "Non Precedential" documents and 164 Precedential opinions (source)

pacommwct

Between May 28, 2021 and August 23, 2021 we have 0 documents. We are missing 49 precedential opinions

sd

We have a gap from August 30th, 2019 to January 4th, 2023, which should amount to a little over 240 missing documents (totals are not so easy to count here)

tex

Between May 05, 2020 and August 20, 2020 we have 2 documents. We are missing 47 documents

texapp

Between May 08, 2020 and July 29, 2020 we have 0 documents. We are missing 3006 documents spread among the 14 texapp courts

texcrimapp

Between May 07, 2020 and August 18, 2020 we have 0 documents. We are missing 46 documents

This is not strictly a 0-gap, we have some documents, but given the volume of opinions this court has, a 0-gap candidate with data seemed strange to me: Between April 27, 2017 and July 25, 2017 we have 18 documents. For that time period, there are 207 opinions available.

texag

Gap from March 11, 2019 to February 10, 2023. Which amount to 199 missing documents (52 in 2019, 68 in 2020, 51 in 2021, 28 in 2022)

vtsuperct

Between June 05, 2020 and March 31st, 2022, we have 4 documents. We are missing \~175 documents from civil, criminal, family and environmental courts.

Between November 16th, 2017 and October 25th, 2019 we have 3 documents. We are missing around 375 documents from civil, criminal, family and environmental courts. Note that these are not only opinions, but orders and decisions too. The scraper does not filter them

Between March 23, 2017 and August 18, 2017 we have 0 documents.

Between January 17, 2020 and May 15, 2020 we have 0 documents.

wyo

We have a gap in opinions from August 28th, 2019 to May 17th 2022. The source reports 445 opinions for that time period.


afcca

Between February 1st, 2021 and February 1st 2023, we have 0 documents in CL. We are missing more than 250 documents from 2021 and 2022 (198)

bap9

Between July 07, 2021 and January 24, 2022 we have 0 documents. We are missing documents, which can be found filtering by document type (no filter view will not go that far back in time)

bap10

We have 0 documents between July 21th, 2021 and January 23th, 2023. From the source, we are missing around 10 documents

olc

We have 0 documents between January 16th, 2020 and July 6th, 2021. Missing 16 documents

Also missing this document for range 2013-10-21 to 2014-06-16

tax

0 documents between November 20th, 2020 and January 26th, 2022. Filtering by those dates on the source, there are more than 200 docs (tried splitting the range in half, there is still more than 100 in each half)

nmcca

Between October 04, 2021 and January 04, 2022 we have 0 documents. There is data in the source for this time period

Other gaps

False 0 gaps

mspb shows a gap from 2019-02-15 to 2022-03-18, but there seems to be no data for that period on the source.

nyag shows many gaps, but in the source itself the documents are sparse. It could be that it publishes little opinions, but we should double check if the link we are scraping is the proper one.

delctapp: Between March 19, 2020 and July 29, 2020 we have 0 documents, but there is no data for that period on the source

ariz: Between September 16, 2019 and January 23, 2020 we have 0 documents. There is no data in the source for that period

indtc has 0 documents between May 24, 2018 and September 24, 2018 we have 0 documents. But there is no data in the soruce for that period

la between June 26, 2019 and October 22, 2019 we have 0 documents. No Opinions on the site for that time period.

ri Between June 27, 2017 and October 20, 2017 we have 0 documents, a 4 month gap. This coincides with the court terms, there are no opinions for this time period

iowa seems to have gaps, but there are no documents in the following periods. Must be related to court terms: Between June 30, 2021 and October 15, 2021 we have 0 documents. Between June 30, 2022 and October 14, 2022 we have 0 documents. Between June 30, 2023 and October 13, 2023 we have 0 documents.

date_approximate

bia, texag are using date_filed_is_approximate, so opinions are clustered into July of each year. Difficult to assess if there is a gap

query

This is the query I am using, which has some steps but I think is readable and runs in reasonable time

WITH 
court_day_counts as (
  SELECT
      court_id,
      date_filed,
      count(*) as n
  FROM
      search_opinion
  INNER JOIN
      (
      SELECT id as cluster_id, docket_id, date_filed
      FROM search_opinioncluster
      WHERE date_part('year', date_filed) > 2010
      ) cluster
      USING (cluster_id)
  INNER JOIN
      (
          SELECT id as docket_id, court_id, source, case_name 
          FROM search_docket
          WHERE source IN (2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,66,82)
      ) docket
      USING (docket_id)
  GROUP BY 
    court_id, date_filed
  ORDER BY
    court_id, date_filed
),

court_lag_counts as (
  SELECT
    court_id,
    date_filed,
    lag(date_filed, 1) OVER (PARTITION BY court_id ORDER BY date_filed) as prev_date,
    n,
    lag(n, 1) OVER (PARTITION BY court_id ORDER BY date_filed)  as prev_n
  FROM
    court_day_counts
)

SELECT
  court_id,
  date_filed,
  prev_date,
  (date_filed - prev_date) AS days,
  n,
  prev_n
FROM
  court_lag_counts
WHERE 
  (date_filed - prev_date) > 3
;

Will writeup the rest of this next week, but at first look, I see 46 distinct sources with 0-gaps wider than 365 days on the dev database. Some of them have multiple big gaps

flooie commented 4 months ago

Thanks @grossir -

grossir commented 2 months ago

Another approach for non obvious gaps is using the search_citation table

We have 8 citations types. On some of these types, such as NEUTRAL, the page may actually be a serial number. For a given reporter and volume, we would expect the page to be a sequence 1,2,3,4... That way, it would be easy to detect missing citations, and through some clever joining, get the date range to back scrape the source

For example, we have 2018 CO 1 and 2018 CO 3, but we are missing 2018 CO 2

Sometimes these gaps may be false positives, because a single opinion may be citable through different systems/reporters (parallel citation?). I will sample at least one opinion of each gap to see if that's the case

Citation types for reference:

    FEDERAL = 1
    STATE = 2
    STATE_REGIONAL = 3
    SPECIALTY = 4
    SCOTUS_EARLY = 5
    LEXIS = 6
    WEST = 7
    NEUTRAL = 8

    CITATION_TYPES = (
        (FEDERAL, "A federal reporter citation (e.g. 5 F. 55)"),
        (
            STATE,
            "A citation in a state-based reporter (e.g. Alabama Reports)",
        ),
        (
            STATE_REGIONAL,
            "A citation in a regional reporter (e.g. Atlantic Reporter)",
        ),
        (
            SPECIALTY,
            "A citation in a specialty reporter (e.g. Lawyers' Edition)",
        ),
        (
            SCOTUS_EARLY,
            "A citation in an early SCOTUS reporter (e.g. 5 Black. 55)",
        ),
        (LEXIS, "A citation in the Lexis system (e.g. 5 LEXIS 55)"),
        (WEST, "A citation in the WestLaw system (e.g. 5 WL 55)"),
        (NEUTRAL, "A vendor neutral citation (e.g. 2013 FL 1)"),
    )

NEUTRAL citation

Gaps

CO

Analysis of these did point to a gap, probably caused because the source we scrape publishes only a subset of opinions

False positives

Neb. Ct. App.

This reporter has type NEUTRAL but the "page" value is actually a page. See source

Ohio

The source allows to filter by year decided, and order by neutral citation. So, it is easy to inspect what we are missing

Volume: 1992

We have "missing" pages. However, after sampling the gaps, we do have the opinions on Courtlistener. We are just missing the neutral citations

grossir commented 3 weeks ago

New gap for dcd due to the time it took us to fix a change in the page. From April 1st, 2024 to May 8th, 2024 we have 0 opinions, but they do exist on the source

Will need to implement a backscraper

grossir commented 3 weeks ago

New gap for nd due to the time it took us to fix a change in the page #1026 From April 5th, 2024 to May 29th, 2024 we have 0 opinions. There is data in the source

Will need to implement a backscraper

mlissner commented 3 weeks ago

These comments, @grossir are really exciting. The fact that we're not only tracking such gaps at this point, but also fixing them, is an exciting milestone. Thank you!

grossir commented 1 week ago

To keep track of it, we have no data for Missouri Supreme Court mo since May 2022. Hopefully it can be solved once freelawproject/courtlistener#3996 is completed

grossir commented 5 days ago

There is gap for njtaxct unpublished opinions, which are scraped by njtaxct_u.py There 284 documents before November 4th, 2022, which is the date filed of the oldest opinion we have for that court and precedential status.

We will need to implement a backscraper to fill this gap