Split file output by variable.

So, the task at hand is reduce all of the .ndjson-files on the smo-dev-server to just one file per facebook/instagram-account and collection. Therefore we must implement a change in dabapush: we must know at the writing stage, e.g. a NDJSON-Writer some metadata for the record we are writing. Right now there is no possibility too transmit these metadata.

Thus, step one should be the implementation of a class that holds:

the data we want too transmit in our pipeline,
a range of metadata, e.g. path of the original file where the record originated, time read.

Step two would be the modification of the file-based writer too use e.g. a variable in the above mentioned metadata to distribute the output of the pipeline to different files.

E.g. all of the tweets of account to a file account1.ndjson, all of account2 in account2.ndjson and so forth.

Step three, again at the reading stage, we cannot emit simple Dicts anymore with just the record inside, we'll actually need to write the metadata we need into the above mentioned class. E.g. collection information on the file's path like parsing information from the path. factli stores it's files with results/${list_id}/${user_id}.ndjson, thus, we can get valied information about the list and and the user from the path.

Leibniz-HBI / dabapush

Split file output by variable. #9