1jamesthompson1 / TAIC-report-summary

Using LLM technologies to analyze transport accident investigation reports
GNU General Public License v3.0
0 stars 0 forks source link

Find out where reports are being dropped off in the data pipeline and for waht reason #286

Open 1jamesthompson1 opened 3 days ago

1jamesthompson1 commented 3 days ago

State

The engine starts by scraping the agencies websites for investigation IDs. Then at the end there is a single table with all the reports documents (safety issues, recommednations, paragraphs and whole text).

Problem

In the pipeline many reports are dropped off for various reason. These reasons are either because of out of control reasons (PDF file is currupted, etc) or engine can't handle it (safety issues agency ids without matching report_ids).

Solution

By going through each step of the engine and comparing the previous tables to the proess one I can see how mayn reports are dropped off at each stage.

Then I can also infer somehwat why the reports are dropped off.

1jamesthompson1 commented 2 days ago

This has been addressed a bit by 117e6db66aef8bf90d5273f223222be83c20ea03