cooperative-computing-lab / makeflow-examples

Example workflows for the Makeflow workflow system.
32 stars 18 forks source link

is there a limit to the tree length? Some commands not executing despite similar ones are #38

Open stemangiola opened 3 years ago

stemangiola commented 3 years ago

I have a makeflow file with ~17K commands. Some of them at the root of the tree

dev/test_simulation/input__slope_0.5__foreignProp_0.8__S_30__whichChanging_1__run_2.rds:
        Rscript ~/PhD/deconvolution/ARMET/dev/test_simulation_makeflow_pipeline/create_input.R 0.5 0.8 30 1 2 dev/test_simulation/input__slope_0.5__foreignProp_0.8__S_30__whichChanging_1__run_2.rds

Are not executed for some reason, while other combination of parameters are. I don't understand why.

stemangiola commented 3 years ago

makefile_test_simulation.makeflow.makeflowlog.zip

makefile_test_simulation.zip

stemangiola commented 3 years ago

As you can see I have few holes in my benchmark

image

The workflow hangs and does not submit any more jobs, and if I interrupt and start again it hangs on starting workflow

btovar commented 3 years ago

Stefano,

I'm going through your logs now...

Ben

On Sun, Oct 4, 2020 at 11:56 PM Stefano Mangiola notifications@github.com wrote:

As you can see I have few holes in my benchmark

[image: image] https://user-images.githubusercontent.com/7232890/95038823-d6401480-071a-11eb-8a41-694da25d81e7.png

The workflow hangs and does not submit any more jobs, and if I interrupt and start again it hangs on starting workflow

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cooperative-computing-lab/makeflow-examples/issues/38#issuecomment-703382906, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXQMYXP2U2FBWS5LDCBG63SJE7XTANCNFSM4SDRDF3A .

btovar commented 3 years ago

Stefano, which command line are you using to run the workflows?

When you say you are changing parameters, are you also changing cores, memory, etc., or only parameters of your tasks?

stemangiola commented 3 years ago

Each block of tests depending on what algorithm is tested is run with different resources

here the command

makeflow -T slurm -j 100  --do-not-save-failed-output test_simulation_makeflow_pipeline/makefile_test_simulation.makeflow
btovar commented 3 years ago

Could you send me the log.out file from: makeflow -T slurm -j 100 --do-not-save-failed-output test_simulation_makeflow_pipeline/makefile_test_simulation.makeflow > log.out 2>&1

stemangiola commented 3 years ago

parsing dev/test_simulation_makeflow_pipeline/makefile_test_simulation.makeflow... local resources: 32 cores, 193277 MB memory, 148722940 MB disk max running remote jobs: 100 max running local jobs: 100 checking dev/test_simulation_makeflow_pipeline/makefile_test_simulation.makeflow for consistency... dev/test_simulation_makeflow_pipeline/makefile_test_simulation.makeflow has 38880 rules. recovering from log file dev/test_simulation_makeflow_pipeline/makefile_test_simulation.makeflow.makeflowlog... checking for old running or failed jobs... checking files for unexpected changes... (use --skip-file-check to skip this step) starting workflow....

and hangs forever

btovar commented 3 years ago

I forgot to add the -dall debug flag, sorry about that:

makeflow -dall -T slurm -j 100 --do-not-save-failed-output test_simulation_makeflow_pipeline/makefile_test_simulation.makeflow > log.out 2>&1

stemangiola commented 3 years ago

log.zip

btovar commented 3 years ago

Stefano, could you also send me dev/test_simulation_makeflow_pipeline/makefile_test_simulation.makeflow.batchlog?

stemangiola commented 3 years ago

I don't have batchlog. I have rerun the whole workflow. I think one of the issue (non consistent) is that I increased the combination in the makefile after the workflow was completed and some of the new banchmark dies not execute.

It is common to execute the whole workflow and try some some parameter combinations

btovar commented 3 years ago

Stefano, something that just occurred to me. Are you re-running the makeflow in place without a cleaning operation in between? It could be that makeflow is getting confused by a mismatch between the previous execution log and a newly modified makeflow.

stemangiola commented 3 years ago

Probably it is the case. But does cleaning lead to the deletion of the dependencies that are already completed. Of course if I delete the log everything gets deleted when the makeflow is called again

btovar commented 3 years ago

Yes, they will be deleted. A safer mode of operation in this case is to not modify the original file, but instead write the updates to differently named makeflow files. Then you can execute each update in sequence.

stemangiola commented 3 years ago

I understand, but this is not always possible in combinatorics scenario.

expand_grid(
    slope = c(-2, -1, -.5, .5, 1, 2), 
    foreign_prop = c(0, 0.5, 0.8),
    S = c(30, 60, 90),
    which_changing = 1:16,
    run = 1:5,
    method = c("ARMET", "cibersort", "llsr", "epic")
)

I can add arbitrary parameter space here with no effort. It would be great if makeflow could update the log file with the new dependencies, and just add them to the tree.

Otherwise makeflow would be suitable to only static workflows.

btovar commented 3 years ago

I think that just appending new rules may be workable, with the understanding that removing a rule, or changing a previously executed rule will result in failure. Would that be something helpful to your use case?

stemangiola commented 3 years ago

Yes. Usually when doing benchmarking we want to increase combinations. We don't need to delete rules as we can ignore already executed dependencies. And we would eliminate rules on another run if needed.

The issue is that if now I add rules to an existing makefile (with log) the only one executing are the new one at the bottom. The new one in the middle are ignored. This mixed behaviour seems more unwanted than designed.

btovar commented 3 years ago

Stefano, thanks for your input! Let me discuss it with the team.