freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
500 stars 138 forks source link

Existing `Docket.date_argued` is sometimes set to None #4150

Open grossir opened 5 days ago

grossir commented 5 days ago

When processing audio files for transcription, we noticed a bunch missing the local_path_mp3 attribute. The did have the local_path_original_file. However, when trying to get the processed file local_path_mp3 using the saved local_path_original_file, the processing would fail because there was no Docket.date_argued to create the upload path

An example audio and its docket

However, how can these records have a local_path_original_file, when to create both we use the same function make_upload_path? https://github.com/freelawproject/courtlistener/blob/28c12c4a570921e376809d1cb62c59f43eb96d71/cl/audio/models.py#L105-L118

This implies that Docket.date_argued existed, but was then deleted. This may happen when:

  1. The oral argument is scraped, and has a Docket.date_argued
  2. The local_path_mp3 fails to be created near that time (probably a doctor problem)
  3. An opinion is scraped for the same Docket, and sets Docket.date_argued to None

Both scrapers use scrapers.utils.update_or_create_docket, which will set date_argued to None if it is not passed https://github.com/freelawproject/courtlistener/blob/28c12c4a570921e376809d1cb62c59f43eb96d71/cl/scrapers/utils.py#L288-L341

Sentry Issue: COURTLISTENER-7GR

AttributeError: 'NoneType' object has no attribute 'year'
(1 additional frame(s) were not displayed)
...
  File "cl/scrapers/tasks.py", line 373, in process_audio_file
    audio_obj.local_path_mp3.save(file_name, cf, save=False)
  File "cl/lib/model_helpers.py", line 219, in make_upload_path
    d.year,
mlissner commented 5 days ago

Ah ha. So we've been wiping out data in that function for years?

grossir commented 5 days ago

We have. 6053 rows affected

select count(*) 
from audio_audio aa 
inner join search_docket sd on sd.id = aa.docket_id 
where sd.date_argued is null  and local_path_original_file <> ''

I guess they are not so many because the oral argument is usually scraped after the opinion, probably because they take some time to process / publish the oral argument?

Also, the good thing is that the date can be reconstructed from the local_path_original_file, and recovered for the dockets

mlissner commented 4 days ago

Usually oral args are before opinions, but yeah, I guess we should clean up all of that.

How many items won't be fixable this way?

Thanks for tracking this down!