bhattlab / MGEfinder

A toolbox for identifying mobile genetic element (MGE) insertions from short-read sequencing data of bacterial isolates.
MIT License
105 stars 16 forks source link

Workflow without raw short reads #3

Closed jackarnold13 closed 4 years ago

jackarnold13 commented 4 years ago

Hi,

Upon recommendation from another group (Dr. Pamer's group at UChicago) I was interested in using MGEfinder to detect MGEs for a family of microbes. However, I ran into a couple of issues that I don't really know how to fix/where to start and I was hoping that you might be able to advise.

I started by running the tutorial to get a feel for the software and make sure it was running smoothly, and I was able to run the tutorial without any major problems. However, when I tried to run the program with my own files, I ran into some trouble.

I think that the main problem might be that I am trying to analyze pre-assembled reads that are already in .fna format.

Below are the steps I followed, and where I ran into problems:

  1. Since I was working with pre-assembled contigs, I began with the alignment step and ran the following commands. (My pre-assembled contig file was named MSK_4_13_contigs.fna and the reference genome was GCF_00373885.fna (Blautia producta downloaded from NCBI)
    % bwa index GCF_00373885.fna
    % bwa mem GCF_000373885.fna MSK_4_13_contigs.fna > MSK_4_13_contigs.GCF_000373885.sam
  2. Following the tutorial process, I then entered the mgefinder environment and ran
    % mgefinder formatbam MSK_4_13_contigs.GCF_000373885.sam MSK_4_13_contigs.GCF_000373885.bam --single-end

    Using the --single-end command since I wasn't using paired reads. And I received the output:

    
    Removing secondary alignments...
    Successfully removed secondary alignments...

Sorting the BAM file by chromosomal location... BAM file successfully sorted...

Index the sorted BAM file... BAM file successfully indexed...

3. I then assembled my working directory as follows:

test_workdir0/ ├── 00.assembly/ │ ├── MSK_4_13_contigs.fna ├── 00.bam/ │ ├── MSK_4_13_contigs.GCF_000373885.bam │ ├── MSK_4_13_contigs.GCF_000373885.bam.bai └── 00.genome/ └── GCF_000373885.fna

4. I then tried to run the `workflow` command in the mgefinder environment as follows

% mgefinder workflow --cores 4 --memory 50000 test_workdir0/

And at the `makedatabase` step, the following message appears:

CHECKING DEPENDENCIES

Current version of snakemake: 3.13.3 Expected version of snakemake: 3.13.3 Current version of einverted: EMBOSS:6.6.0.0 Expected version of einverted: EMBOSS:6.6.0.0 Current version of bowtie2: 2.3.5 Expected version of bowtie2: 2.3.5 Current version of samtools: 1.9 Expected version of samtools: 1.9 Current version of cd-hit: 4.8.1 Expected version of cd-hit: 4.8.1 ###############################

PARAMETERS

command: makedatabase inferseqfiles: ('test_workdir2/01.mgefinder/GCF_000373885/GCF_000373885.all_inferseq.txt',) minimum_size: 30 maximum_size: 200000 threads: 1 memory: 16000 force: True output_dir: test_workdir2/02.database/GCF_000373885 prefix: GCF_000373885.database #################### Parsing inferseq files Combining the inferseq files... Loading file 1/3: test_workdir2/01.mgefinder/GCF_000373885/MSK_4_13_contigs/03.inferseq_assembly.MSK_4_13_contigs.GCF_000373885.tsv Loading file 2/3: test_workdir2/01.mgefinder/GCF_000373885/MSK_4_13_contigs/03.inferseq_reference.MSK_4_13_contigs.GCF_000373885.tsv Loading file 3/3: test_workdir2/01.mgefinder/GCF_000373885/MSK_4_13_contigs/03.inferseq_overlap.MSK_4_13_contigs.GCF_000373885.tsv Deleting old database directory... No termini found in the input file... Waiting at most 5 seconds for missing files. Error in job make_database while creating output files test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna, test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna.1.bt2. MissingOutputException in line 186 of /Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile: Missing files after 5 seconds: test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna.1.bt2 This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait. Will exit after finishing currently running jobs. Exiting because a job execution failed. Look above for error message Traceback (most recent call last): File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/bin/mgefinder", line 8, in sys.exit(cli()) File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in call return self.main(args, kwargs) File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, ctx.params) File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke return callback(args, **kwargs) File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 50, in workflow _workflow(workdir, snakefile, configfile, cores, memory, unlock, rerun_incomplete, keep_going) File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow.py", line 26, in _workflow shell(cmd) File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/snakemake/shell.py", line 88, in new raise sp.CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'snakemake -s /Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile --config wd=test_workdir2 memory=16000 --cores 1 --configfile /Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/config.yml' returned non-zero exit status 1.


When I try to open the files in `01.mgefinder/<genome>/` all the `.tsv` files are empty

Is there something that I am doing incorrectly, or is there another process that I should follow since my contigs are already pre-assembled?

I would love to be able to use MGEfinder, so any help would be very much appreciated! Thank you in advance!

Best regards,
Jack
durrantmm commented 4 years ago

Hi Jack, thanks for trying out MGEfinder. MGEfinder is intended to combine information from both raw (unassembled) sequencing reads and draft assemblies to increase the sensitivity of detection for large and repetitive insertions. Excluding raw reads from the analysis will severely limit how useful MGEfinder will be relative to other tools intended to identify structural variants, such as Mauve.

It is possible to create a workflow that circumvents the raw reads using MGEfinder, but it won't be as sensitive as if you use raw reads. Do you have the raw reads available?

jackarnold13 commented 4 years ago

Hi! Thanks for your quick response! I was originally using Mauve but compared to MGEfinder it seemed much more limited in terms of finding specifically MGEs for a larger set of organisms.

Unfortunately I do not have the raw reads available, just a database of pre-assembled contigs from a collaborator. How would I go about creating a workflow to circumvent the raw reads?

Thanks!

durrantmm commented 4 years ago

I can get this ready for you and I'll have it back to you soon.

jackarnold13 commented 4 years ago

Thank you so much! I appreciate all of you help! Would you like me to send you any example files?

durrantmm commented 4 years ago

I don't think that will be necessary, but if it is I'll let you know.

jackarnold13 commented 4 years ago

Sounds great, thanks!

durrantmm commented 4 years ago

Ok, here are the Snakefile and the config file you can use with the mgefinder workflow command.

workflow_noreads_config.yml.txt workflow_noreads.Snakefile.txt

You can delete your 00.bam directory, it doesn't need that. It only needs 00.assembly and 00.genome.

You can then analyze your working directory with the command mgefinder workflow --snakefile workflow_noreads.Snakefile.txt --configfile workflow_noreads_config.yml.txt workdir.

As I mentioned before, this approach will dramatically reduce sensitivity. When I run it on the tutorial dataset, I only identify 65% as many insertions as when I include the raw reads. I also am not sure what the precision would be, it could be identifying more false positives. I would be careful when interpreting your results.

Let me know if you have any issues or questions going forward.

jackarnold13 commented 4 years ago

Thank you so much! This workflow worked great and did exactly what I needed it to do, which is basically identifying potential MGEs so that I have targets to look at more closely from a larger dataset (as a starting point for a more in depth analysis - so 65% of the insertions should be plenty, even if some of them are also false positives).

Thank you again for all of your help, MGEfinder is a great tool and I'll let you know if I run into any more questions while using it!

Best, Jack

durrantmm commented 4 years ago

Happy to help! Can I ask what institution you are working at?

jackarnold13 commented 4 years ago

Of course! I'm at the University of Chicago, working jointly in the Mimee and the Nagler groups.