1jamesthompson1 / TAIC-report-summary

Using LLM technologies to analyze transport accident investigation reports
https://taic-viewer-72e8675c1c03.herokuapp.com/
GNU General Public License v3.0
0 stars 0 forks source link

Rehaul of engine #174

Closed 1jamesthompson1 closed 1 week ago

1jamesthompson1 commented 2 weeks ago

State

The engine is suppoed to be a pipeline that takes the pdfs reports and outputs useful datasets.

Problem

The work which has been conducted in the notebooks is well out of date with engine. This has meant that development has gone fast but the pipeline is broken and does not work end to end as it should.

Solution

I ned to fix it so that it can complete the wohle pipeline

  1. Get all report pdfs and parse into text files
  2. Extract the safety issues, report sections etc.
  3. Embed the various datasets and load/update the database which the webapp uses.

This three steps should results in a few dataframe files in the form of pickles or something.

Then from there two more small parts need to be added. However as they involve the deployment they might be left until another issue one #172 is closer to being solved. step 0: get all the current data form the databases so that we dont constantly repeat the same work step n+1: upload the newly calculated datasets to the database

Lastly it is worth noting that this can also be a chance to refactor and make the experience of running it smoother with better logs.

Related issues

1jamesthompson1 commented 2 weeks ago

given the changing capabilities of the project I have purged alot of older modules to keep the space fresh.

Further more the structure of the projects should really change.

Currently it is Gather_Wrangle -> Extract_Analyze.

Really it sohuld be gather -> extract -> analyze

1jamesthompson1 commented 2 weeks ago

The updating of the previous modules have been completed. They now all follow #153. However not all follow #61.

Now an embedding class needs to be created.

The last step of database download and upload should wait until what deployment looksl ike has been decided.