Archival Pipelines - Githubissues

pekasen commented 2 years ago

As of yet dabapush initializes pipelines solely by the readers and writers name, thus, a call like dabapush run default would look for a reader named 'default ' and a writer named default. The reader extracts all records according to it's programming from the specified file and glob-pattern and passes these records to the writer.

This hinders archival pipelines in two ways: in an archival pipeline have want to have a dependency on the outcome of another pipeline, e.g. we want to archive all the files that have been successfully read by dabapush. Therefore, the input to this pipeline would not be a path/glob-pattern pair but rather the logged files of the already finished pipeline.

Giving the reader that functionality seems a bit spaghetti-like, overloading the class with functionality that is not related to reading and processing files to records in a way that the writer-class objects can process them further.

Cleanest solution would be to enhance the pipelines further: a third object type e.g. named Attacher could be the cleanest solution to that problem. It would take over the responsibility to discover and open files for the reader and through inheritance we can design multiple, different Attachers, e.g. for reading files from disk by means of a path and glob-pattern, by reading the log and filtering for files from specific, already finished pipelines or even read remote files from S3 or SFTP.

Thus, a pipeline would include at least three objects: an Attacher, which decides which files to open, a reader that extracts meaningful records from these files and a writer that persist/writes these records. Initializing these three-piece pipelines can still be achieved by name only, thus, no changes in the structure of the configuration file format is necessary, although some fields must be moved from the reader configuration to an attacher configuration.

In summary of the new pipeline features:

[ ] pipelines should be able to read logged files from another pipeline, i.e. to move already read file from local storage to S3.
[ ] another class, the Attacher, is responsible for file/discovery and opening, the reader extracts meaningsful records from the opened file.
[ ] file should only be logged if processing is complete and did not fail.
[ ] dabapush is responsible for ensuring safe processing of files and records and keeps the log – which alleviates the Writer-classes from this responsibility.
[ ] failed items should not crash the pipeline but rather be persist into a special location, e.g. a file like ${date}-${pipeline}-malformed-objects.jsonl.
[ ] failed items log should be in a format that a Attacher is able too handle that file and process the entries accordingly.
[ ] therefore the log items should be enhanced with a tag which pipeline processed which file.

FlxVctr commented 2 years ago

Couldn't an archiver be part of the generic writer class and simply switched on/off at instance creation (archiving=True/False)?

Edit: aah, I get it, you need all the information about the raw data from the reader. Right.

FlxVctr commented 2 years ago

Another idea: why not have a 'Pipeline' class, that contains reader and writer and therefore all necessary information. This then could have a property whether it's archiving or not. It gets the info about what to archive from the writer and how to archive from the reader.

FlxVctr commented 2 years ago

But I think I am a bit lost. A basic architecture diagram of how it works now and how it's supposed to work in your proposal would be helpful.

Leibniz-HBI / dabapush

Archival Pipelines #3