MediaUncovered / NewsAnalysis

use word embeddings to uncover bias in newspapers
5 stars 1 forks source link

data should be loaded in an efficient and dependable way #16

Closed todorus closed 6 years ago

todorus commented 6 years ago

Download Data should be downloaded from the database onto local disk Download should be able to resume in case of connection failure

Load Data should be able to be loaded from local disk

Both actions should happen in a streaming manner, to be able to handle big volumes of data

todorus commented 6 years ago

A csv file would be a good candidate for a data as it can easily be appended to, and is well supported by Python.

Resuming of downloads can be achieved by ordering the database query by id. This way the last line of the csv file will always have the row with the highest id that was successfully downloaded

todorus commented 6 years ago

@Tilana you said the loading from disk part should be done using a generator, right?

Tilana commented 6 years ago

Either an iterator object or a txt file with one sentence in each line would work

Tilana commented 6 years ago

@todorus in python pandas is great to store databases. From there the data can be easily written to a csv file. In order save and load data a pickle file might be handy as it should be faster than a csv file and it's also possible to create an iterator object from it.