Arkadiy-Garber / SprayNPray

Rapid and simple taxonomic profiling of genome and metagenome contigs
GNU General Public License v3.0
27 stars 4 forks source link

universal.tblout not found #5

Open CongLiu37 opened 2 years ago

CongLiu37 commented 2 years ago

Hi!

I am trying to binning by SprayNPray, and my command is

spray-and-pray.py -g /flash/HusnikU/Cong/assembly/ACAB_assembly.fna -bam /flash/HusnikU/Cong/bam/ACAB.bam -out ACAB -t 32 -ref /apps/unit/BioinfoUgrp/DB/diamondDB/ncbi/238/nr.dmnd --bin

But I got the following error:

Traceback (most recent call last):
  File "/home/c/c-liu/Softwares/SprayNPray/spray-and-pray.py", line 1166, in <module>
    tblout = open("universal.tblout")
FileNotFoundError: [Errno 2] No such file or directory: 'universal.tblout'

Wondering how to solve it. Thank you!

CongLiu37 commented 2 years ago

Also, the README says "Additionally, the user can specify a set of criteria (e.g. GC-content, read coverage, coding density, closest taxonomic hits) to re-write the provided contigs into a new FASTA file." I am wondering how to specify binning criteria. Could you please provide some example usage for this?

Arkadiy-Garber commented 2 years ago

Hi Cong,

Thanks for your interest in the SprayNPray software. I found the bug causing the error that you pasted in your first comment. Just updated the script, please do a fresh download from the github repo and try again. No need to re-install the conda environment. If you continue to have issues, please let me know!

Regarding the binning criteria, I provided a few examples in the GitHub Repo: https://github.com/Arkadiy-Garber/SprayNPray#decontaminating-a-pseudomonas-assembly and https://github.com/Arkadiy-Garber/SprayNPray#pulling-out-endosymbiont-genomes-from-an-assembly-of-the-mealybug-maconellicoccus-hirsutus. If it is still unclear, could you let me know more-specifically, the type of binning that you are doing. The hierarchical clustering that SprayNPray does (based on tetranucleotide frequency, GC content, and codon usage bias) is still something that is being optimized and isn't the strength of this tool. So if you have certain criteria in mind (e.g. GC-content, coverage, coding density), I would advise using those.

Thanks, and don't hesitate to reach out if you have additional questions. Arkadiy

CongLiu37 commented 2 years ago

Hi Arkadiy-Garber,

Thank you for your reply. Regards to binning, I want to look into both host and endosymbionts. So I am hoping SprayNPray could act in the way with "--bin", but allow to specifiy binning criteria (estimating the number of genomes, clustering based on GC, gene dense and coverage, and output bins: one fasta and one summary for each).

Sincerely,

Cong

Arkadiy-Garber commented 2 years ago

Hi Cong,

Okay thanks for the clarification. Please let me know how that goes. If you are dealing with a eukaryotic host and prokaryotic symbionts, manually specifying the minimum (-gc) and maximum (-GC) GC-content, as well as coding density (-cd for minimum coding density and -CD for maximum coding density), should provide you with enough resolution to separate host and endosymbiont contigs. But if you want to use the --bin flag, please let me know how that goes...this sort of unsupervised binning (hierarchical clustering) is not something that SprayNPray does better than other well-established algorithms, like BinSanity and and MetaBAT.

In other words, to subset contigs from the original assembly, you should specify the minimum and maximum coding densities and GC contents to be included in the new FASTA file (and you can also specify coverage and the expected taxonomy).

However, I recommend running SprayNPray on default, seeing what the the output summary file looks like (basename + .csv), and then setting the criteria in the subsequent runs (you can have SprayNPray skip the time-intensive BLAST step by providing the previous BLAST output via the -blast flag).

Let me know if you have other questions. Curious to see how your results turn out!

Cheers, Arkadiy

CongLiu37 commented 2 years ago

Hi Arkadiy-Garber,

Thank you for your reply. I picked 5 assemblies to test, and SprayNPray found 2-5 bins/assembly. CheckM showed that often only 1 bin can be annotated as bacterial, while others are put at root. I am suspect that this is because I did not do length filter. Is it possible that short contigs (e.g. <1kbp) disturb binning? and I am wondering if you have any suggestion on the min contig size subjected to binning.

Sincerely

Cong

Arkadiy-Garber commented 2 years ago

Hi Cong,

For SprayNPray, I would suggest a minimum contig length of 1000 bp, since that is about the size of a single bacterial gene. However, for things like tetranucleotide frequency, %GC-content, and coverage to start being reliable metrics, I would say that you will need contigs longer than 5000 bp. Programs like Anvio default to 2000 bp as the minimum size, but I think this will still lead to a lot of noise.

If your assembly does not have very many contigs > 5 kb, then using taxonomic information on contigs as short as 1000 bp (that have at least one gene) will help. Feel free to send along a sample SprayNPray output file from one of your assembly runs. I can take a look and perhaps provide better insights into the best way to move forward with your data.

Cheers, Arkadiy