freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
529 stars 144 forks source link

Store iquery crawl HTMLs #4231

Closed albertisfu closed 1 month ago

albertisfu commented 1 month ago

We're going to crawl all PACER iquery pages so we need to save the HTML of each page before running the daemon so they can be reprocessed when https://github.com/freelawproject/courtlistener/issues/2185 is solved.

mlissner commented 1 month ago

Thanks. @quevon24, you just did this for the free docs crawler. How much time do you think it'd take to add it to the iquery crawler?

quevon24 commented 1 month ago

Thanks. @quevon24, you just did this for the free docs crawler. How much time do you think it'd take to add it to the iquery crawler?

It shouldn't take that long, we just need to add add a new upload type and an object to relate it to the file (uses a generic foreign key)

mlissner commented 1 month ago

Cool. I'm torn about whether you or Alberto are better to take it on. It's a distraction for both of you, really. You just did it, but iquery is his wheelhouse. Anybody have a preference?

albertisfu commented 1 month ago

Sure, I could do it and Kevin could review it if that's ok

mlissner commented 1 month ago

Works for me.