Leibniz-HBI / dabapush

Data Base pusher for social media data (Twitter for the beginning) – pre-alpha version
https://pypi.org/project/dabapush/
MIT License
0 stars 0 forks source link

Split file output by variable. #9

Open pekasen opened 2 years ago

pekasen commented 2 years ago

So, the task at hand is reduce all of the .ndjson-files on the smo-dev-server to just one file per facebook/instagram-account and collection. Therefore we must implement a change in dabapush: we must know at the writing stage, e.g. a NDJSON-Writer some metadata for the record we are writing. Right now there is no possibility too transmit these metadata.

Thus, step one should be the implementation of a class that holds:

Step two would be the modification of the file-based writer too use e.g. a variable in the above mentioned metadata to distribute the output of the pipeline to different files.

E.g. all of the tweets of account to a file account1.ndjson, all of account2 in account2.ndjson and so forth.

Step three, again at the reading stage, we cannot emit simple Dicts anymore with just the record inside, we'll actually need to write the metadata we need into the above mentioned class. E.g. collection information on the file's path like parsing information from the path. factli stores it's files with results/${list_id}/${user_id}.ndjson, thus, we can get valied information about the list and and the user from the path.