bhattlab / MGEfinder

A toolbox for identifying mobile genetic element (MGE) insertions from short-read sequencing data of bacterial isolates.
MIT License
105 stars 16 forks source link

Errors at combining the inferseq files. #11

Closed Bowmore12 closed 4 years ago

Bowmore12 commented 4 years ago

Hi!

I was getting the following errors when I run mgefinder workflow using my data sets. How can I resolve this issue?

Finished job 16.
7 of 16 steps (44%) done

rule make_database:
    input: ./01.mgefinder/NC_000962_3/NC_000962_3.all_inferseq.txt
    output: ./02.database/NC_000962_3/NC_000962_3.database.fna, ./02.database/NC_000962_3/NC_000962_3.database.fna.1.bt2
    jobid: 13
    benchmark: ./02.database/NC_000962_3/NC_000962_3.database.benchmark.txt
    wildcards: genome=NC_000962_3

#### CHECKING DEPENDENCIES ####
Current version of snakemake: 3.13.3
Expected version of snakemake: 3.13.3
Current version of einverted: EMBOSS:6.6.0.0
Expected version of einverted: EMBOSS:6.6.0.0
Current version of bowtie2: 2.3.5
Expected version of bowtie2: 2.3.5
Current version of samtools: 1.9
Expected version of samtools: 1.9
Current version of cd-hit: 4.8.1
Expected version of cd-hit: 4.8.1
###############################
#### PARAMETERS ####
command: makedatabase
inferseqfiles: ('./01.mgefinder/NC_000962_3/NC_000962_3.all_inferseq.txt',)
minimum_size: 30
maximum_size: 200000
threads: 1
memory: 16000
force: True
output_dir: ./02.database/NC_000962_3
prefix: NC_000962_3.database
####################
Parsing inferseq files
Combining the inferseq files...
Loading file 1/3: ./01.mgefinder/NC_000962_3/JPN-B2019-Rv-1224_S1_L001/03.inferseq_assembly.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
Loading file 2/3: ./01.mgefinder/NC_000962_3/JPN-B2019-Rv-1224_S1_L001/03.inferseq_reference.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
Loading file 3/3: ./01.mgefinder/NC_000962_3/JPN-B2019-Rv-1224_S1_L001/03.inferseq_overlap.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
Deleting old database directory...
No termini found in the input file...
Waiting at most 5 seconds for missing files.
Error in job make_database while creating output files ./02.database/NC_000962_3/NC_000962_3.database.fna, ./02.database/NC_000962_3/NC_000962_3.database.fna.1.bt2.
MissingOutputException in line 192 of /home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile:
Missing files after 5 seconds:
./02.database/NC_000962_3/NC_000962_3.database.fna
./02.database/NC_000962_3/NC_000962_3.database.fna.1.bt2
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/mgefinder/bin/mgefinder", line 8, in <module>
    sys.exit(cli())
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 51, in workflow
    _workflow(workdir, snakefile, configfile, cores, memory, unlock, rerun_incomplete, keep_going)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow.py", line 26, in _workflow
    shell(cmd)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/snakemake/shell.py", line 88, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'snakemake -s /home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile --config wd=. memory=16000 --cores 1 --configfile /home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/config.yml ' returned non-zero exit status 1.
.
├── 00.assembly
│   ├── JPN-B2019-Rv-1224_S1_L001.fna
│   ├── JPN-B2019-Rv-1224_S1_L001.fna.1.bt2
│   ├── JPN-B2019-Rv-1224_S1_L001.fna.2.bt2
│   ├── JPN-B2019-Rv-1224_S1_L001.fna.3.bt2
│   ├── JPN-B2019-Rv-1224_S1_L001.fna.4.bt2
│   ├── JPN-B2019-Rv-1224_S1_L001.fna.rev.1.bt2
│   └── JPN-B2019-Rv-1224_S1_L001.fna.rev.2.bt2
├── 00.bam
│   ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.bam
│   └── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.bam.bai
├── 00.genome
│   ├── NC_000962_3.fna
│   ├── NC_000962_3.fna.1.bt2
│   ├── NC_000962_3.fna.2.bt2
│   ├── NC_000962_3.fna.3.bt2
│   ├── NC_000962_3.fna.4.bt2
│   ├── NC_000962_3.fna.amb
│   ├── NC_000962_3.fna.ann
│   ├── NC_000962_3.fna.bwt
│   ├── NC_000962_3.fna.pac
│   ├── NC_000962_3.fna.rev.1.bt2
│   ├── NC_000962_3.fna.rev.2.bt2
│   ├── NC_000962_3.fna.sa
│   └── log
│       ├── NC_000962_3.index_bowtie2.benchmark.txt
│       ├── NC_000962_3.index_bowtie2.log
│       └── NC_000962_3.index_bowtie2.log.err
├── 01.mgefinder
│   └── NC_000962_3
│       ├── JPN-B2019-Rv-1224_S1_L001
│       │   ├── 01.find.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
│       │   ├── 02.pair.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
│       │   ├── 03.inferseq_assembly.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
│       │   ├── 03.inferseq_overlap.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
│       │   ├── 03.inferseq_reference.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
│       │   └── log
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.benchmark.txt
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.find.benchmark.txt
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.find.log
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.find.log.err
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_assembly.benchmark.txt
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_assembly.log
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_assembly.log.err
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_overlap.log
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_overlap.log.err
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_reference.benchmark.txt
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_reference.log
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_reference.log.err
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.pair.benchmark.txt
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.pair.log
│       │       └── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.pair.log.err
│       └── NC_000962_3.all_inferseq.txt
└── 02.database
    └── NC_000962_3
        └── NC_000962_3.database.benchmark.txt

I confirmed both generating the test dataset and analyzing them using mgefinder workflow work fine.

Thank you in advance.

Yosm

durrantmm commented 4 years ago

Thanks for this, it looks like a very similar issue to #9. My guess is that MGEfinder couldn't identify any potential MGE insertions. To double check this, could you run the command:

    wc -l yourworkdir/01.mgefinder/*inferseq*.tsv

Thanks!

durrantmm commented 4 years ago

This is entirely possible given that you only have 1 isolate. Sensitivity increases as more isolates are included.

As another check, I would like to know what you see when running:

    wc -l yourworkdir/01.mgefinder/*
Bowmore12 commented 4 years ago

Hi, durrantmm. Thank you for your prompt support!

Yes, your guess was correct. I successfully completed the workflow when I increase the number of isolates from 1 to 9.

I am very happy to use MGEfinder for my research, and I have additional questions. (please feel free to move these topics to new threads if needed).

  1. Regarding the sensitivity (and the specificity) you mentioned, could you tell me how many isolates should be included for the workflow analysis ?

  2. How can we know the genotyping results is correct or not? For example, if I can use reference strain fastq for the analysis, I can see whether MGEs are successfully detected or not by looking annotated information. But I do not know using the reference stain data means same or not for the case of clinical isolates because the pipeline do the analysis based on the reference genome.

  3. For the spades.py analysis indicated, do you recommend to use --isolates and/or --careful options ? In my case, spades always suggest to use --isolates option.

  4. To speed up the workflow analysis, I am thinking to use job scheduler (UGE). I just wonder we are very happy if the template commands are supplied. I have 1000 clinical isolates. Batch mode may be another alternative.

Many thanks.

durrantmm commented 4 years ago

Glad to hear that you find it useful.

  1. Hard to say exactly, and it depends heavily on your collection. In simulations, we found that having 10 isolates dramatically increased sensitivity, so I would say 9 isolates is a good amount.

  2. Interesting question. You should think of MGEfinder like a SNP caller for large insertions. If you aligned a reference strain fastq to the reference genome, you would (hopefully) not detect any SNPs, because they are the same isolate. This is also true for MGEfinder, you wouldn't detect any MGE insertions. If you want to detect insertions in the reference strain, you will need to use the reference strain as a query, with an alignment to some other genome in the the 00.bam directory.

There are several ways to increase confidence that the genotypes you are seeing are real. You can limit the analysis to only the MGEs that you trust to be real. For example, MGEs that contain a transposase, or MGEs that are annotated as phage. Just using those MGEs could help increase the quality of the genotyping.

  1. Yes, using the --isolate in spades would be appropriate here. Looking at their documentation "This flag is highly recommended for high-coverage isolate and multi-cell data; improves the assembly quality and running time. Not compatible with --only-error-correction or --careful options." So if you have a well-covered isolate, that flag would be appropriate.

  2. Using it with a job scheduler definitely makes sense, I have done this myself. It would be quite a bit of work for me to get that up and running for you. If you want my help doing that, maybe we could discuss a more formal collaboration.

I would recommend reading my paper if you haven't already, and/or watching my presentation. Then feel free to contact the corresponding author on the manuscript if you want to pursue a formal collaboration.