Open mdietze opened 1 year ago
met.process is such a swiss army knife that I agree it is a useful standalone tool and cloud compatible (reprise of browndog functionality?). From that perspective, I would be in favor of separating out the database portions from the general steps (download, standardize, extract, gapfill, convert), with a wrapper that receives necessary updates to be made to bety to inputs and filepath records. May make it easier to debug too.
@mdietze Would you be able to share the "steps to reproduce" a way to configure Pecan that will launch a Workflow that could be more distributed, and the "acceptance criteria" of a first deliverable for this issue?
This issue is stale because it has been open 365 days with no activity.
Description
Currently the processing of input data for the PEcAn workflow is done sequentially at run time. As an initial test case for being able to move to a more cloud-based workflow that is asynchronous, distributed, and event driven I propose that we start with met.process as an initial test case.
Proposed Solution
Put either just met.process, or all of do.conversions, within its own container with its own message queue. The message queue would need to pass in the relevant portion of the settings: which met data source, what site (name, lat, lon) or vector of sites, what data range, which model's file format is the target, etc.
Some issues to consider:
met.process currently does a lot of talking to the BETY database. Do we want to continue to support this?
If so, do we want to give the met.process containers access to BETY or do we want to have a single bit of code (e.g. in the workflow container) responsible for all database i/o? Do we want to use this task as an excuse/opportunity to reduce the dependence on BETY by logging less in the database?
Database communication is currently managed by convert.inputs, which already has the option to run a conversion step locally or in the HPC. This would in some ways be the "easiest" place to insert a rabbitMQ + Docker option, but this option might require either putting each met operator (there are dozens) in its own container or creating a general container that holds all of them (meaning that one thing that would be included in the message would be which operator to apply). The later task seems easier to implement/maintain but gives us less granular control over the scaling.
In general, met.process has the following steps
Relevant bits of code to look at: