Rehaul of engine - Githubissues

1jamesthompson1 commented 2 weeks ago

State

The engine is suppoed to be a pipeline that takes the pdfs reports and outputs useful datasets.

Problem

The work which has been conducted in the notebooks is well out of date with engine. This has meant that development has gone fast but the pipeline is broken and does not work end to end as it should.

Solution

I ned to fix it so that it can complete the wohle pipeline

Get all report pdfs and parse into text files
Extract the safety issues, report sections etc.
Embed the various datasets and load/update the database which the webapp uses.

This three steps should results in a few dataframe files in the form of pickles or something.

Then from there two more small parts need to be added. However as they involve the deployment they might be left until another issue one #172 is closer to being solved. step 0: get all the current data form the databases so that we dont constantly repeat the same work step n+1: upload the newly calculated datasets to the database

Lastly it is worth noting that this can also be a chance to refactor and make the experience of running it smoother with better logs.

Related issues

153
59
61

1jamesthompson1 commented 2 weeks ago

given the changing capabilities of the project I have purged alot of older modules to keep the space fresh.

Further more the structure of the projects should really change.

Currently it is Gather_Wrangle -> Extract_Analyze.

Really it sohuld be gather -> extract -> analyze

1jamesthompson1 commented 2 weeks ago

The updating of the previous modules have been completed. They now all follow #153. However not all follow #61.

Now an embedding class needs to be created.

The last step of database download and upload should wait until what deployment looksl ike has been decided.

1jamesthompson1 / TAIC-report-summary

Rehaul of engine #174

State

Problem

Solution

153

59

61