Open grossir opened 5 days ago
Ah ha. So we've been wiping out data in that function for years?
We have. 6053 rows affected
select count(*)
from audio_audio aa
inner join search_docket sd on sd.id = aa.docket_id
where sd.date_argued is null and local_path_original_file <> ''
I guess they are not so many because the oral argument is usually scraped after the opinion, probably because they take some time to process / publish the oral argument?
Also, the good thing is that the date can be reconstructed from the local_path_original_file
, and recovered for the dockets
Usually oral args are before opinions, but yeah, I guess we should clean up all of that.
How many items won't be fixable this way?
Thanks for tracking this down!
When processing audio files for transcription, we noticed a bunch missing the
local_path_mp3
attribute. The did have thelocal_path_original_file
. However, when trying to get the processed filelocal_path_mp3
using the savedlocal_path_original_file
, the processing would fail because there was noDocket.date_argued
to create the upload pathAn example audio and its docket
However, how can these records have a
local_path_original_file
, when to create both we use the same function make_upload_path? https://github.com/freelawproject/courtlistener/blob/28c12c4a570921e376809d1cb62c59f43eb96d71/cl/audio/models.py#L105-L118This implies that
Docket.date_argued
existed, but was then deleted. This may happen when:local_path_mp3
fails to be created near that time (probably a doctor problem)Both scrapers use
scrapers.utils.update_or_create_docket
, which will set date_argued to None if it is not passed https://github.com/freelawproject/courtlistener/blob/28c12c4a570921e376809d1cb62c59f43eb96d71/cl/scrapers/utils.py#L288-L341Sentry Issue: COURTLISTENER-7GR