CorrelAidxNL / BankTrack

Collaboration between BankTrack and CorrelAid Netherlands
GNU General Public License v3.0
0 stars 0 forks source link

Bootstrap starting dataset for scraper #10

Open andrewsutjahjo opened 2 years ago

andrewsutjahjo commented 2 years ago

We need seed data for the scraper + diff-er to start running any of our pipeline.

This story takes backtrack URLs, filenames, BankTrack's document data, and our internal metadata structure #9

and Outputs a populated starting {data_structure} object/instance which can be used by other people.

SPIKE FOR THIS:

Depends on #9 for knowledge of if this is a json, flat file (parquet?), csv, Graph database, or qbit stored archive.