Closed dingquanyu closed 2 months ago
Hi Geoffrey,
I reviewed your PR and liked your initial implementation, but I had some design ideas.
My goal is to let the user decide whether to use splitting/padding by changing a single boolean in the config.yaml. To accommodate this, we should avoid complex snakemake features such as checkpoints or separate workflows.
So far, the pipeline operates at the sequence/fold level. By adding the splitting procedure, as you suggested, we would move away from that, potentially making the entire pipeline less legible. Since the splitting logic appears computationally fairly light and more of a challenge in I/O terms, I think we can justify running it on the node that snakemake runs on. Snakemake itself requires a considerable amount of resources when the number of jobs is large.
Computing the splits before computing the snakemake DAG, should allow us to more seamlessly integrate with the existing rule structure_inference: https://github.com/KosinskiLab/AlphaPulldownSnakemake/blob/7edc828e5f178aa4600c6509d8f3686337adaa8f/workflow/Snakefile#L245
The output
field of the rule could be adapted to return a list of folds, containing all the folds that make up a given determined cluster. The input
field is only for guiding the execution flow and is not practically used, so we don't need to adapt it. Finally, the requested_fold
parameter of the rule would need to be adapted to provide the input format for multiple folds expected by run_structure_prediction.py. The num_desired_msas etc could be handled similarly, perhaps through a lookup table for the given cluster, which the rule uses in the params field and finally passes to run_structure_predition.py
These design changes would allow us to switch splitting/padding on and off without major changes to the current pipeline. Please let me know your opinion on these design ideas and I am happy to help out with the implementation :)
Hi Geoffrey,
I reviewed your PR and liked your initial implementation, but I had some design ideas.
My goal is to let the user decide whether to use splitting/padding by changing a single boolean in the config.yaml. To accommodate this, we should avoid complex snakemake features such as checkpoints or separate workflows.
So far, the pipeline operates at the sequence/fold level. By adding the splitting procedure, as you suggested, we would move away from that, potentially making the entire pipeline less legible. Since the splitting logic appears computationally fairly light and more of a challenge in I/O terms, I think we can justify running it on the node that snakemake runs on. Snakemake itself requires a considerable amount of resources when the number of jobs is large.
Computing the splits before computing the snakemake DAG, should allow us to more seamlessly integrate with the existing rule structure_inference:
The
output
field of the rule could be adapted to return a list of folds, containing all the folds that make up a given determined cluster. Theinput
field is only for guiding the execution flow and is not practically used, so we don't need to adapt it. Finally, therequested_fold
parameter of the rule would need to be adapted to provide the input format for multiple folds expected by run_structure_prediction.py. The num_desired_msas etc could be handled similarly, perhaps through a lookup table for the given cluster, which the rule uses in the params field and finally passes to run_structure_predition.pyThese design changes would allow us to switch splitting/padding on and off without major changes to the current pipeline. Please let me know your opinion on these design ideas and I am happy to help out with the implementation :)
Thanks a lot Valentin for looking into this. I really like your idea of calculating the splits before building the DAG and merging these steps into the main Snakemake file. I agree with this option as you mentioned: The output field of the rule could be adapted to return a list of folds, containing all the folds that make up a given determined cluster.
grouping and structural modelling are split into 2 workflow smk files