Web scraping of other agencies

1jamesthompson1 / TAIC-report-summary

Using LLM technologies to analyze transport accident investigation reports

https://taic-document-searcher-cfdkgxgnc3bxgbeg.australiaeast-01.azurewebsites.net/

GNU General Public License v3.0

0 stars 0 forks source link

Web scraping of other agencies #253

Open 1jamesthompson1 opened 2 weeks ago

1jamesthompson1 commented 2 weeks ago

Currently the webscraping works for TAIC only. It does this by using a template and loops through looking at each webpage and seeing if it has report pdf etc.

This technique can be extended for both #254 and #252. There could be a new class built that gives it the template as well as how to actually scrape the report webpage for report pdf and information.

1jamesthompson1 commented 2 weeks ago

ATSB will be straight forward as they use the same naming structure as TAIC. Therefore it is predictable and will result in minimal wasted page loading.

However TSBs naming structure is a bit more complex which could result in quite a large serach space with lots of wasted webpage loading.

1jamesthompson1 commented 2 days ago

I am currently at the point where I have both working theory rtheory except for two problems:

ATSB website doesn't load using a usual request. Therefore I might need to use selenium to simulate the full browser or such
TSB has such a wide serach space of naming IDs that some simple multithreading might help speed it up very much, or maybe I need to look into do a selenium search for the IDs.