ML4GW / pinto

Job environment management and execution tool
0 stars 3 forks source link

Pass outputs from components of Pipeline #17

Open EthanMarx opened 2 years ago

EthanMarx commented 2 years ago

Is it currently, (or in general) possible for pinto to pass outputs from one component of a pipeline to the next component?

Using BBHnet data generation pipeline as an example, In the background generation we currently pass a start, stop and minimum_length. The script then searches between start and stop for the first continuous science segment of minimum_length. It would be nice to then pass the start and stop times of this segment to the glitch generation script.

In this scenario we can of course determine the segment a priori and pass it manually to both scripts, but something to think about down the line.

alecgunny commented 2 years ago

This is an interesting premise, and something I've thought about as well. In principle the most straightforward way would be to use some file in a shared output directly, but for use cases like the one you describe that's certainly not the most elegant way of doing it. Probably worth raising as a feature request in the Pinto repo and we can discuss more about what the syntax might look like there.

alecgunny commented 2 years ago

I am now realizing that this issue is already filed in the Pinto repo and not BBHNet like I thought because I was being lazy and not paying attention. This is a great idea, curious to hear your thoughts on what an implementation might look like.

The thing to keep in mind is that pinto doesn't actually call these functions under the hood, it leverages the Python interfaces of Poetry and Conda's command-line tools, which essentially launch scripts as subprocesses. So any variables from those scripts never get passed back to pinto, and in fact never even exist in the same process. The only insight pinto has into what's going on in them is through their stdout and stderr streams, so the scripts themselves would need to be equipped with some mechanism to log information for the next component to leverage in such a way that pinto would know to look for it.

One other mechanism for passing information between components might be environment variables, since you can reference these in the pinto config, but each component's config arguments (including the referenced environment variables) don't get set until the component is run. So in principle, the previous component could set the environment variable to be whatever it likes. That said, I'm not sure (in fact I'm doubtful) that os.environ setting propogates up to a calling process, so there might need to be a more complex mechanism in place here.

While I'm not sure which mechanism will make communication between components possible, one way of implementing a mechanism that would abstract some of the details from each component could be to make a function decorator that wraps up typeo and implements the communication mechanism on the outputs of the function. So all the function needs to do is return the values it wants to pass forward, and the decorator will handle the rest.