Closed VincentVerelst closed 2 months ago
Abstracting it in a separate class is the way to go. I am however wondering if we should keep the abstraction where the full dataframe is always loaded, or of the 'storage tracker' can offer more granular functionality?
For instance methods like (please improve my bad naming):
get_jobs_by_status(max_jobs, status:List[String]) (returns jobs for given status list) update_job_metadata(job_row) (writes updated job state back to the database) commit() (persist modifications to file or database)
I am however wondering if we should keep the abstraction where the full dataframe is always loaded,
Indeed, but that involves a bigger refactor (or maybe even full rewrite) of the multi job manager I guess. In this first PR we want to make a first step by abstracting out just the storage aspect, without fiddling too much with the rest
@soxofaan yes, fine to keep this one limited and then move on to the bigger story. I mainly wanted to inform you to avoid that we put a lot of work in polishing this one, to only then replace it with something else afterwards.
@soxofaan , as discussed earlier, I've abstracted away the read/write helper for CSV and Parquet files and allowed the user to give a custom read/write helper (with the eye on adding support for URL's for which a user might need to provide a custom request method). An interim review from you would be helpful, before continuing to URL support.
FYI: I already pushed some tweaks of my own to address my latest notes
@soxofaan , for issue #571 I created a JobTrackerStorage to manage everything job tracker related in the MultiBackendJobManager. For now it only supports storing as CSV to PosixPath, so no extra functionality has been added. Would be great to have your feedback already on this before adding functionality for storing as Parquet and storing to URL as well.