Leibniz-HBI / dabapush

Data Base pusher for social media data (Twitter for the beginning) – pre-alpha version
https://pypi.org/project/dabapush/
MIT License
0 stars 0 forks source link

Archival Pipelines #3

Open pekasen opened 2 years ago

pekasen commented 2 years ago

As of yet dabapush initializes pipelines solely by the readers and writers name, thus, a call like dabapush run default would look for a reader named 'default ' and a writer named default. The reader extracts all records according to it's programming from the specified file and glob-pattern and passes these records to the writer.

This hinders archival pipelines in two ways: in an archival pipeline have want to have a dependency on the outcome of another pipeline, e.g. we want to archive all the files that have been successfully read by dabapush. Therefore, the input to this pipeline would not be a path/glob-pattern pair but rather the logged files of the already finished pipeline.

Giving the reader that functionality seems a bit spaghetti-like, overloading the class with functionality that is not related to reading and processing files to records in a way that the writer-class objects can process them further.

Cleanest solution would be to enhance the pipelines further: a third object type e.g. named Attacher could be the cleanest solution to that problem. It would take over the responsibility to discover and open files for the reader and through inheritance we can design multiple, different Attachers, e.g. for reading files from disk by means of a path and glob-pattern, by reading the log and filtering for files from specific, already finished pipelines or even read remote files from S3 or SFTP.

Thus, a pipeline would include at least three objects: an Attacher, which decides which files to open, a reader that extracts meaningful records from these files and a writer that persist/writes these records. Initializing these three-piece pipelines can still be achieved by name only, thus, no changes in the structure of the configuration file format is necessary, although some fields must be moved from the reader configuration to an attacher configuration.

In summary of the new pipeline features:

FlxVctr commented 2 years ago

Couldn't an archiver be part of the generic writer class and simply switched on/off at instance creation (archiving=True/False)?

Edit: aah, I get it, you need all the information about the raw data from the reader. Right.

FlxVctr commented 2 years ago

Another idea: why not have a 'Pipeline' class, that contains reader and writer and therefore all necessary information. This then could have a property whether it's archiving or not. It gets the info about what to archive from the writer and how to archive from the reader.

FlxVctr commented 2 years ago

But I think I am a bit lost. A basic architecture diagram of how it works now and how it's supposed to work in your proposal would be helpful.