The engine starts by scraping the agencies websites for investigation IDs. Then at the end there is a single table with all the reports documents (safety issues, recommednations, paragraphs and whole text).
Problem
In the pipeline many reports are dropped off for various reason. These reasons are either because of out of control reasons (PDF file is currupted, etc) or engine can't handle it (safety issues agency ids without matching report_ids).
Solution
By going through each step of the engine and comparing the previous tables to the proess one I can see how mayn reports are dropped off at each stage.
Then I can also infer somehwat why the reports are dropped off.
State
The engine starts by scraping the agencies websites for investigation IDs. Then at the end there is a single table with all the reports documents (safety issues, recommednations, paragraphs and whole text).
Problem
In the pipeline many reports are dropped off for various reason. These reasons are either because of out of control reasons (PDF file is currupted, etc) or engine can't handle it (safety issues agency ids without matching report_ids).
Solution
By going through each step of the engine and comparing the previous tables to the proess one I can see how mayn reports are dropped off at each stage.
Then I can also infer somehwat why the reports are dropped off.