antonisdim / haystac

Code repository for the HAYSTAC pipeline
MIT License
12 stars 4 forks source link

database command fails #5

Closed Kelzor closed 3 years ago

Kelzor commented 3 years ago

Hello!

database --mode build fails for me with the following error. Snakemake doesn't successfully remove entrez-selected-seqs.tsv though, and the file seems to be intact. Also, the .snakemake directory and log are missing. I appreciate any suggestions for what could be mucking things up!

Thanks, Kelly

(haystac) [keblevin@cg31-1:/scratch/keblevin/haystac]$ haystac database --mode build \
>     --query '"Yersinia"[Organism] AND "complete genome"[All Fields]' \
>     --output yersinia_example

HAYSTAC v 0.3.2

Date: 2021-05-22 18:20:33.407773

Config parameters:

 mode: build
 db_output: /scratch/keblevin/haystac/mycobacteriumDB
 query: "Mycobacterium"[Organism] AND "complete genome"[All Fields]
 bowtie2_scaling: 25.0
 rank: species
 cores: 48
 mem: 191693

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 48
Rules claiming more threads will be scaled down.
Provided resources: entrez_api=3, mem_mb=191693
Job counts:
        count   jobs
        1       calculate_db_chunks
        1       entrez_db_list
        1       entrez_nuccore_query
        1       entrez_pick_sequences
        1       entrez_taxa_query
        1       index_all_accessions
        1       index_all_db_chunks
        1       randomise_db_order
        8

[Sat May 22 18:20:35 2021]
Job 2: Fetching sequence metadata from the NCBI Nucleotide database for the query.

[Sat May 22 18:20:42 2021]
Finished job 2.
1 of 8 steps (12%) done

[Sat May 22 18:20:42 2021]
Job 3: Querying the NCBI Taxonomy database and fetching taxonomic metadata.

[Sat May 22 18:20:46 2021]
Finished job 3.
2 of 8 steps (25%) done

[Sat May 22 18:20:46 2021]
Job 1: Selecting the longest sequence per taxon in the entrez query.
Downstream jobs will be updated after completion.

Updating job entrez_db_list.
Updating job randomise_db_order.
Updating job index_all_accessions.
[Sat May 22 18:20:47 2021]
Error in rule entrez_pick_sequences:
    jobid: 1
    output: /scratch/keblevin/haystac/mycobacteriumDB/entrez/entrez-selected-seqs.tsv

Traceback (most recent call last):
  File "/home/keblevin/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 570, in _callback
    callback(job)
  File "/home/keblevin/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/scheduler.py", line 544, in _proceed
    job, update_dynamic=update_dynamic
  File "/home/keblevin/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/dag.py", line 1348, in finish
    updated_dag = self.update_checkpoint_dependencies(jobs)
  File "/home/keblevin/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/dag.py", line 1312, in update_checkpoint_dependencies
    self.postprocess()
  File "/home/keblevin/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/dag.py", line 1193, in postprocess
    self.cleanup()
  File "/home/keblevin/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/dag.py", line 258, in cleanup
    del self.depending[dep][job]
KeyError: entrez_taxa_query
Removing output files of failed job entrez_pick_sequences since they might be corrupted:
/scratch/keblevin/haystac/mycobacteriumDB/entrez/entrez-selected-seqs.tsv
Trying to restart job 1.

[Sat May 22 18:20:47 2021]
Job 1: Selecting the longest sequence per taxon in the entrez query.
Downstream jobs will be updated after completion.

[Sat May 22 18:20:48 2021]
Finished job 1.
3 of 5 steps (60%) done
Complete log: /scratch/keblevin/haystac/.snakemake/log/2021-05-22T182033.414375.snakemake.log
antonisdim commented 3 years ago

Hello Kelly,

Thank you for using haystac and apologies for the super delayed response !

I noticed that in your command you are using a query for the genus Yersinia, but in the haystac output a query for the genus Mycobacterium pops up, which is odd.

Just in case I ran two database builds, one with Yersinia and one with Mycobacterium, and both of them were completed without any errors.

Would it be possible that you are using an output folder, that contains the outputs from a previous database you tried to build ? I did a test run for such a scenario but a validation error was raised.

It would be really helpful if you could please provide a few more details on how you tried to run this command. In the meantime I will try to reproduce the error you are getting, and of course update you if I come across the cause of the problem.

Let me know what you think and thank you for your help !

Best, Antony

antonisdim commented 3 years ago

Hello Kelly,

I hope you are doing great and apologies for the delayed response !

I actually managed to reproduce your error. It was caused due to an issue related to the latest version of snakemake.

If you would like please follow the updated installation instructions, and install the latest version of haystac through github. Of course please do not hesitate to contact us if you face any new issues.

Thank you for all your patience !

Best, Antony

Kelzor commented 3 years ago

Hi, Antony!

Apologies for my delayed response too. I hadn't circled back to this yet. Great news!

To clarify, I should re-install haystac via mamba or conda, as stated on the github page?

Thanks, Kelly

antonisdim commented 3 years ago

Hello Kelly,

I hope you are doing great and apologies for the late response !

I have uploaded an executable on my personal conda channel. Could you please install haystac in a fresh conda environment with the following command:

mamba install -c antonisdim haystac

Please let me know how it goes and thank you for your patience !

Best, Antony