Find out where reports are being dropped off in the data pipeline and for waht reason

State

The engine starts by scraping the agencies websites for investigation IDs. Then at the end there is a single table with all the reports documents (safety issues, recommednations, paragraphs and whole text).

Problem

In the pipeline many reports are dropped off for various reason. These reasons are either because of out of control reasons (PDF file is currupted, etc) or engine can't handle it (safety issues agency ids without matching report_ids).

Solution

By going through each step of the engine and comparing the previous tables to the proess one I can see how mayn reports are dropped off at each stage.

Then I can also infer somehwat why the reports are dropped off.

1jamesthompson1 / TAIC-report-summary