Arcadia-Science / ProteinCartography

a pipeline to build similarity maps of protein space
MIT License
30 stars 9 forks source link

`make_pdb` ESMFold jobs running even when input structure is provided #82

Closed naailkhan28 closed 9 months ago

naailkhan28 commented 9 months ago

Description of the bug

When running the Search mode Snakemake pipeline, I see that the make_pdb rule is run and a request is made to the ESMFold API, even when I've provided a .pdb file containing my input structure. My experimentally solved crystal structure is being replaced with a less accurate ESMFold prediction which is not ideal.

I've attached my full Snakemake log (it failed due to a python error but ignore this), and my input .yml config, .pdb and .fasta files.

Command used and terminal output

snakemake --snakefile Snakefile --configfile demo/1OHG/config_1OHG.yml --use-conda --cores 64 --conda-frontend conda

# Here's the top part of my Snakemake log:

Config file ./config.yml is extended by additional config specified via the command line.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Job stats:
job                                         count    min threads    max threads
----------------------------------------  -------  -------------  -------------
aggregate_features                              1              1              1
aggregate_foldseek_fraction_seq_identity        1              1              1
aggregate_lists                                 1              1              1
all                                             1              1              1
assess_pdbs                                     1              1              1
calculate_concordance                           1              1              1
dim_reduction                                   2              1              1
download_pdbs                                   1              1              1
extract_blast_hits                              1              1              1
extract_foldseek_hits                           1              1              1
fetch_uniprot_metadata                          1              1              1
filter_uniprot_hits                             1              1              1
foldseek_clustering                             1             16             16
get_source                                      1              1              1
input_distances                                 1              1              1
leiden_clustering                               1              1              1
make_pdb                                        1              1              1
map_refseqids                                   1              1              1
plot_cluster_distributions                      1              1              1
plot_interactive                                2              1              1
plot_semantic_analysis                          1              1              1
plot_similarity_leiden                          1              1              1
plot_similarity_strucluster                     1              1              1
run_blast                                       1              1              1
run_foldseek                                    1              1              1
total                                          27              1             16

Select jobs to execute...

[Thu Jan 25 12:10:17 2024]
rule run_blast:
    input: demo/1OHG/P49861.fasta
    output: demo/1OHG/output/blastresults/P49861.blastresults.tsv
    jobid: 13
    benchmark: demo/1OHG/output/benchmarks/P49861.run_blast.txt
    wildcards: protid=P49861
    resources: tmpdir=/tmp

Activating conda environment: /blockstorage/ProteinCartography/.snakemake/conda/47f42b1b7af144c2415a38423667989d
[Thu Jan 25 12:10:17 2024]
rule make_pdb:
    input: demo/1OHG/P49861.fasta
    output: demo/1OHG/P49861.pdb
    jobid: 10
    benchmark: demo/1OHG/output/benchmarks/P49861.make_pdb.txt
    wildcards: protid=P49861
    resources: tmpdir=/tmp

Relevant files

P49861_fasta_pdb_yml_config_snakemake_log.zip

System information

mezarque commented 9 months ago

Thanks for letting us know about this! We've seen this occur occasionally with other analyses as well and haven't been able to ID the cause. We'll look into it.

mezarque commented 9 months ago

One hypothesis that @mertcelebi shared is that this might occur when the timestamp of the FASTA file is later than that of the PDB file. If so, it's possible that this is a more general issue with how Snakemake detects files. Could you try running this code from within the demo/1OHG/ directory to see if it resolves the issue?

touch P49861.fasta &&
sleep .5 &&
touch P49861.pdb
mezarque commented 9 months ago

Separately from this issue, the experimentally-determined PDB provided appears to contain multiple chains, which unfortunately isn't supported yet by the pipeline. We're working to resolve the current issue, but I expect you'll run into a new error once the replacement issue is resolved.

naailkhan28 commented 9 months ago

I just did the experiment you suggested and that seemed to be the issue - when the FASTA is older than the PDB file then it works as expected and no make_pdb job is run.

However, if you swap those two touch calls (ie, the PDB is older than the FASTA) or you run them at the same time (with no sleep .5 command) then the make_pdb job is run. Looks like a tweak to the snakemake pipeline is needed!

naailkhan28 commented 9 months ago

Separately from this issue, the experimentally-determined PDB provided appears to contain multiple chains, which unfortunately isn't supported yet by the pipeline. We're working to resolve the current issue, but I expect you'll run into a new error once the replacement issue is resolved.

Before you implement a way of handling multiple chains, would it be helpful to handle multi-chain PDB files in a nicer way? Like maybe automatically taking chain A or throwing a warning or something.

mezarque commented 9 months ago

That's a helpful suggestion; we can explore whether an intermediate version like this would help.

keithchev commented 9 months ago

@naailkhan28 I think we've fixed this in #83, which should ensure that make_pdb is never called for PDB files that are in the input directory. Please let us know if this resolves this issue on your end!

naailkhan28 commented 9 months ago

Just tested the latest version, looks fixed! Great work guys :)