KosinskiLab / AlphaPulldownSnakemake

GNU General Public License v3.0
3 stars 0 forks source link

Add clustering and padding split version #13

Closed dingquanyu closed 2 months ago

dingquanyu commented 3 months ago

grouping and structural modelling are split into 2 workflow smk files

maurerv commented 3 months ago

Hi Geoffrey,

I reviewed your PR and liked your initial implementation, but I had some design ideas.

My goal is to let the user decide whether to use splitting/padding by changing a single boolean in the config.yaml. To accommodate this, we should avoid complex snakemake features such as checkpoints or separate workflows.

So far, the pipeline operates at the sequence/fold level. By adding the splitting procedure, as you suggested, we would move away from that, potentially making the entire pipeline less legible. Since the splitting logic appears computationally fairly light and more of a challenge in I/O terms, I think we can justify running it on the node that snakemake runs on. Snakemake itself requires a considerable amount of resources when the number of jobs is large.

Computing the splits before computing the snakemake DAG, should allow us to more seamlessly integrate with the existing rule structure_inference: https://github.com/KosinskiLab/AlphaPulldownSnakemake/blob/7edc828e5f178aa4600c6509d8f3686337adaa8f/workflow/Snakefile#L245

The output field of the rule could be adapted to return a list of folds, containing all the folds that make up a given determined cluster. The input field is only for guiding the execution flow and is not practically used, so we don't need to adapt it. Finally, the requested_fold parameter of the rule would need to be adapted to provide the input format for multiple folds expected by run_structure_prediction.py. The num_desired_msas etc could be handled similarly, perhaps through a lookup table for the given cluster, which the rule uses in the params field and finally passes to run_structure_predition.py

These design changes would allow us to switch splitting/padding on and off without major changes to the current pipeline. Please let me know your opinion on these design ideas and I am happy to help out with the implementation :)

dingquanyu commented 3 months ago

Hi Geoffrey,

I reviewed your PR and liked your initial implementation, but I had some design ideas.

My goal is to let the user decide whether to use splitting/padding by changing a single boolean in the config.yaml. To accommodate this, we should avoid complex snakemake features such as checkpoints or separate workflows.

So far, the pipeline operates at the sequence/fold level. By adding the splitting procedure, as you suggested, we would move away from that, potentially making the entire pipeline less legible. Since the splitting logic appears computationally fairly light and more of a challenge in I/O terms, I think we can justify running it on the node that snakemake runs on. Snakemake itself requires a considerable amount of resources when the number of jobs is large.

Computing the splits before computing the snakemake DAG, should allow us to more seamlessly integrate with the existing rule structure_inference:

https://github.com/KosinskiLab/AlphaPulldownSnakemake/blob/7edc828e5f178aa4600c6509d8f3686337adaa8f/workflow/Snakefile#L245

The output field of the rule could be adapted to return a list of folds, containing all the folds that make up a given determined cluster. The input field is only for guiding the execution flow and is not practically used, so we don't need to adapt it. Finally, the requested_fold parameter of the rule would need to be adapted to provide the input format for multiple folds expected by run_structure_prediction.py. The num_desired_msas etc could be handled similarly, perhaps through a lookup table for the given cluster, which the rule uses in the params field and finally passes to run_structure_predition.py

These design changes would allow us to switch splitting/padding on and off without major changes to the current pipeline. Please let me know your opinion on these design ideas and I am happy to help out with the implementation :)

Thanks a lot Valentin for looking into this. I really like your idea of calculating the splits before building the DAG and merging these steps into the main Snakemake file. I agree with this option as you mentioned: The output field of the rule could be adapted to return a list of folds, containing all the folds that make up a given determined cluster.