Closed jackarnold13 closed 4 years ago
Hi Jack, thanks for trying out MGEfinder. MGEfinder is intended to combine information from both raw (unassembled) sequencing reads and draft assemblies to increase the sensitivity of detection for large and repetitive insertions. Excluding raw reads from the analysis will severely limit how useful MGEfinder will be relative to other tools intended to identify structural variants, such as Mauve.
It is possible to create a workflow that circumvents the raw reads using MGEfinder, but it won't be as sensitive as if you use raw reads. Do you have the raw reads available?
Hi! Thanks for your quick response! I was originally using Mauve but compared to MGEfinder it seemed much more limited in terms of finding specifically MGEs for a larger set of organisms.
Unfortunately I do not have the raw reads available, just a database of pre-assembled contigs from a collaborator. How would I go about creating a workflow to circumvent the raw reads?
Thanks!
I can get this ready for you and I'll have it back to you soon.
Thank you so much! I appreciate all of you help! Would you like me to send you any example files?
I don't think that will be necessary, but if it is I'll let you know.
Sounds great, thanks!
Ok, here are the Snakefile and the config file you can use with the mgefinder workflow
command.
workflow_noreads_config.yml.txt workflow_noreads.Snakefile.txt
You can delete your 00.bam
directory, it doesn't need that. It only needs 00.assembly
and 00.genome
.
You can then analyze your working directory with the command mgefinder workflow --snakefile workflow_noreads.Snakefile.txt --configfile workflow_noreads_config.yml.txt workdir
.
As I mentioned before, this approach will dramatically reduce sensitivity. When I run it on the tutorial dataset, I only identify 65% as many insertions as when I include the raw reads. I also am not sure what the precision would be, it could be identifying more false positives. I would be careful when interpreting your results.
Let me know if you have any issues or questions going forward.
Thank you so much! This workflow worked great and did exactly what I needed it to do, which is basically identifying potential MGEs so that I have targets to look at more closely from a larger dataset (as a starting point for a more in depth analysis - so 65% of the insertions should be plenty, even if some of them are also false positives).
Thank you again for all of your help, MGEfinder is a great tool and I'll let you know if I run into any more questions while using it!
Best, Jack
Happy to help! Can I ask what institution you are working at?
Of course! I'm at the University of Chicago, working jointly in the Mimee and the Nagler groups.
Hi,
Upon recommendation from another group (Dr. Pamer's group at UChicago) I was interested in using MGEfinder to detect MGEs for a family of microbes. However, I ran into a couple of issues that I don't really know how to fix/where to start and I was hoping that you might be able to advise.
I started by running the tutorial to get a feel for the software and make sure it was running smoothly, and I was able to run the tutorial without any major problems. However, when I tried to run the program with my own files, I ran into some trouble.
I think that the main problem might be that I am trying to analyze pre-assembled reads that are already in
.fna
format.Below are the steps I followed, and where I ran into problems:
MSK_4_13_contigs.fna
and the reference genome wasGCF_00373885.fna
(Blautia producta downloaded from NCBI)mgefinder
environment and ranUsing the
--single-end
command since I wasn't using paired reads. And I received the output:Sorting the BAM file by chromosomal location... BAM file successfully sorted...
Index the sorted BAM file... BAM file successfully indexed...
test_workdir0/ ├── 00.assembly/ │ ├── MSK_4_13_contigs.fna ├── 00.bam/ │ ├── MSK_4_13_contigs.GCF_000373885.bam │ ├── MSK_4_13_contigs.GCF_000373885.bam.bai └── 00.genome/ └── GCF_000373885.fna
% mgefinder workflow --cores 4 --memory 50000 test_workdir0/
CHECKING DEPENDENCIES
Current version of snakemake: 3.13.3 Expected version of snakemake: 3.13.3 Current version of einverted: EMBOSS:6.6.0.0 Expected version of einverted: EMBOSS:6.6.0.0 Current version of bowtie2: 2.3.5 Expected version of bowtie2: 2.3.5 Current version of samtools: 1.9 Expected version of samtools: 1.9 Current version of cd-hit: 4.8.1 Expected version of cd-hit: 4.8.1 ###############################
PARAMETERS
command: makedatabase inferseqfiles: ('test_workdir2/01.mgefinder/GCF_000373885/GCF_000373885.all_inferseq.txt',) minimum_size: 30 maximum_size: 200000 threads: 1 memory: 16000 force: True output_dir: test_workdir2/02.database/GCF_000373885 prefix: GCF_000373885.database #################### Parsing inferseq files Combining the inferseq files... Loading file 1/3: test_workdir2/01.mgefinder/GCF_000373885/MSK_4_13_contigs/03.inferseq_assembly.MSK_4_13_contigs.GCF_000373885.tsv Loading file 2/3: test_workdir2/01.mgefinder/GCF_000373885/MSK_4_13_contigs/03.inferseq_reference.MSK_4_13_contigs.GCF_000373885.tsv Loading file 3/3: test_workdir2/01.mgefinder/GCF_000373885/MSK_4_13_contigs/03.inferseq_overlap.MSK_4_13_contigs.GCF_000373885.tsv Deleting old database directory... No termini found in the input file... Waiting at most 5 seconds for missing files. Error in job make_database while creating output files test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna, test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna.1.bt2. MissingOutputException in line 186 of /Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile: Missing files after 5 seconds: test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna.1.bt2 This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait. Will exit after finishing currently running jobs. Exiting because a job execution failed. Look above for error message Traceback (most recent call last): File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/bin/mgefinder", line 8, in
sys.exit(cli())
File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(args, kwargs)
File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(args, **kwargs)
File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 50, in workflow
_workflow(workdir, snakefile, configfile, cores, memory, unlock, rerun_incomplete, keep_going)
File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow.py", line 26, in _workflow
shell(cmd)
File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/snakemake/shell.py", line 88, in new
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'snakemake -s /Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile --config wd=test_workdir2 memory=16000 --cores 1 --configfile /Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/config.yml' returned non-zero exit status 1.