CorrelAidxNL / BankTrack

Collaboration between BankTrack and CorrelAid Netherlands
GNU General Public License v3.0
0 stars 0 forks source link

Find HTML page linking to the pdf #21

Open fdabl opened 2 years ago

fdabl commented 2 years ago

Goal is to find the HTML page that links to the pdf found on BankTrack's website. One approach could be to (Google) search for the pdf name, check first n links, search these for the link to the pdf. Another approach would be to scrape the entire bank website — but BankTrack sometimes does not save the “correct” pdf name; could use keywords to create good heuristics.

Best to focus on the first or similarly simpler approaches first. This may be a tough one!