Decide how many pages of cases we want to scrape

MahinRahman8901 / c10-Court-Transcript

A data pipeline to automate the enhancement, discoverability and and analysis of real Courtroom documents.

2 stars 0 forks source link

Decide how many pages of cases we want to scrape #38

Open ErvinRex opened 2 months ago

ErvinRex commented 2 months ago

Task Description

We need a backlog of cases to visualise a pattern of judge verdicts, including the daily ~2+ new high court cases.
The more cases we look at, the more it'll cost when using GPT-AI.
Will start with few pages now, but later will want a larger historical dataset.

User Stories

Stakeholders investigating judge verdict patterns will need a historical dataset to get a clear image of the way a judge rules.

Relevant Files [If Available]

ayeshaa63 commented 2 months ago

I think we could start with the first 5 pages - that would give us 50 cases to work with as a starting point.

ErvinRex commented 2 months ago

Now that we know it does not take long to scrape and run the ETL pipeline, we could viably go through ~50 pages to get a large enough dataset to be used in our dashboard.

The daily run will only run on about 2/3 cases in the High Court for now, we can look at how looking at more courts can affect this time later.