cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
133 stars 114 forks source link

Allowing edits to the makeflow file "on the go" #1873

Open stemangiola opened 6 years ago

stemangiola commented 6 years ago

Hello,

while enjoying your makeflow, I am thinking more and more to the limitations due to this feature.

For pipeline developing, is often necessary to insert a new algorithm of combination of parameters for an algorithm within the already executed pipeline. In my case for example, I test an algorithm with a combination of parameters:

a = 1:100 b = 1:10 c="mode1", "mode2", "mode3"

100103 combinations

It is pretty common after having executed the whole pipeline, to be wanting to test for b = 11

Now, your team already suggested to produce many makefiles instead of pissing together a high number of commands. Although this can be more modular and robust generally speaking, in a grid of parameters like this (and even more complex bioinfomatics pipelines) deciding where to modularize is non easy, and sometime even not feasible.

To gain momentum and adoption by researchers like me, Makeflow should be able in my opinion (like GNU make) to be robust to changes in the makefile after full or partial execution, simply updating the existing dependencies.

Some genomics pipelines (is not my case) take 3 weeks to complete (mine takes 2 days), and there cannot be the risk of having to run from the top if a secondary small change is required.

Makeflow is easy to use and almost no brainier for queuing systems. I hope you can consider this request, as I would like to stick to your system, however at the moment is requiring a lot of time and mental energies to go around this hard needs of makeflow.

Thanks.

dthain commented 6 years ago

@stemangiola we had a long discussion about this at our team meeting this week.

There are two underlying issues here:

1 - If the goal is to re-run what "changes" then we need to be careful about the whole definition of the job, including the files, the text of the command, and possibly even the relevant environment variables. Standard make just looks at the modification times of the files, which isn't complete.

2 - The transaction log refers to job by their ordinal position in the makeflow file. This means that log is no longer valid if any change is made to the contents of the makeflow file. We need a way of referring to jobs in a way that allows for modification of the file.

To fix this, we should do the following:

1 - Change the makeflow job id from a simple integer into a string in the makeflow core. 2 - Compute the makeflow job id as a "content based identifier" by computing a consistent checksum of the relevant job text. 3 - When recovering the transaction log, assume that the makeflow file may have changed, so it could contain new rules not previously mentioned in the log (or vice versa).

I think if we do this, then you should be able to change the makeflow file at will, and keep re-running things until you are satisfied. And by covering the definition of the job more broadly, you get a stronger guarantee than GNU make. A nice side effect is that the log still contains the complete record of all files created, so makeflow --clean would delete all files created by previous runs of makeflow.

dthain commented 6 years ago

@nhazekam @trshaffer would you like to add anything to that?

tshaffe1 commented 6 years ago

No, that pretty much covers it. The content-based IDs need to work with JX as well, so that changes to the workflow arguments don't cause this kind of corruption.

nhazekam commented 6 years ago

This definitely outlines what we were concerned about. The proposed solution seems sufficient and doesn't break the goal of Makeflow's static nature, as the changes are between runs and not during a run.

stemangiola commented 6 years ago

Thanks for your response,

I think this would be a great step toward the application of makeflow to pipelineing, where in big teams things (e.g., algorithms to test, to compare, parameters to change etc..) change dynamically as the discussion within the team/s proceeds. And changing a parameter for an algorithm of a bigger pipeline should be ideally as easy as replacing 2 with 3 in a text file. (I imagine this is much less trivial than it seems)

I think that the effort of achieving that and improving the safety compared to make is good.

Just a curiosity: yes make just looks at datetimes, and in being so naive makes things quite easy for pipelineing, as its behavior is predictable. I assume this could be unsafe in some cases compared to checking the MD5, however I cannot think about any case where the datetime of a file would be an unsafe option to exclusively check. Could you make an example?

Thanks

Thanks. I am really interested to updates on this feature.

tshaffe1 commented 6 years ago

After deciding on a hashing scheme, some future PR(s) will need to

Hashes are useless for debugging a Makeflow. We currently report nodeids in a number of places, which isn't great. We should probably move to line numbers for all error messages. We could continue using nodeid internally within a single invocation and use hashes in the log.

Alternatively, we could just truncate the hash to 64 bits and use it as the nodeid. This approach requires minimal changes to Makeflow's behavior, but requires that we only show the user line numbers.

Nekel-Seyew commented 5 years ago

@trshaffer should this be closed an a new feature request be opened instead?

tshaffe1 commented 5 years ago

That's an option. This sort of got away from me when the focus shifted to archive mode.