Open amcomeau opened 1 week ago
I have created the following overview (see below), which I will also include in the frontpage.
First off, the krona files should have been deleted after the end of the analysis (except for the krona.html file). The others are intermediate files to create the krona plot. I'm not sure what happened there, unless the --debug flag was on or if they are left-overs from the run that crashed.
The number of 10.000 genes that your referring to would be the total number of key-genes that were found by Metascan. Not the total number of every gene present. So the numbers in the total.ovw file give an estimate of the potential of the genes or processes within the larger metabolic pathways in a sample.
I tried to explain the numbers in the total.ovw file in the explanation below. I hope it makes sense. if not, let me know!
total.ovw : overview file of the analysis of the key genes
Each metabolic cycle is represented by a set of key-genes.
(If a depth(coverage) file is supplied:
So if we have two genes (a and b) and three bins (I, II, III), with the following depth:
I abb | II aa | III b |
---|---|---|
x | x | x |
x | x | x |
x | x | |
x | ||
x | ||
x |
Total gene count = 6 (3(abb) + 2(aa) +1(b))
Total organism count = 3 (1(I)+1(II)+1(III))
Total depth = 11 (6+3+2)
Total gene depth = 26 ( (3x6) + (2x3) + (1x2))
this would yield the following outcome:
gene | N#gene | %gene | N#org | %Org | O-Depth | %O-Depth | G-Depth | %G-Depth |
---|---|---|---|---|---|---|---|---|
a | 3 (1+2+0) | 50% (3/6) | 2 (1+1+0) | 66% (2/3) | 9 (6+3+0) | 81.2% (9/11) | 12 (6+(3+3)+0) | 46.2 (12/26) |
b | 3 (2+0+1) | 50% (3/6) | 2 (1+0+1) | 66% (2/3) | 8 (6+0+2) | 72.7% (8/11) | 14 ((6+6)+0+2) | 53.9 (14/26) |
Besides the generic overview files, Metascan creates a number of files for each bin/(meta)genome/fasta file.
XXXXXXXX.ovw : overview file of the keygenes in the bin.
XXXXXXXX.tsv : overview file in tab format.
XXXXXXXX.gbk : NCBI genbank file.
XXXXXXXX.gff : gff file
XXXXXXXX.fna : fna file
XXXXXXXX.fsa : fsa file
XXXXXXXX.sqn : sequin file
XXXXXXXX.embl : ENA embl file
XXXXXXXX.log : log file
XXXXXXXX.f16 : fasta file containing rRNA sequences
XXXXXXXX.tabel : feature table
XXXXXXXX.txt : general info on the annotation
XXXXXXXX.kegg : File that can be used to reconstruct pathwyas in KEGG (https://www.genome.jp/kegg/mapper/reconstruct.html)
XXXXXXXX.fall : all genes in bases (CDS and rRNA)
XXXXXXXX.hmm.faa : Genes found through the HMM algorithm. I.e. the metabolic genes (Amino Acids)
XXXXXXXX.hmm.ffn : Genes found through the HMM algorithm. I.e. the metabolic genes (Nucleic Acids)
XXXXXXXX.all.faa : All annotated genes (both through HMM (metabolic) and the legacy Prokka annotation (non metanolic)) (Amino Acids)
XXXXXXXX.all.ffn : All annotated genes (both through HMM (metabolic) and the legacy Prokka annotation (non metanolic)) (Nucleic Acids)
XXXXXXXX.total.sort.tbl : (sorted, by score) intermediate file of the hits found by Metascan for each CDS
XXXXXXXX.total.uniq.tbl : intermediate file containing the top hit for each CDS
XXXXXXXX.aaonly.tsv : Overview file when using pre gene-called ORFs instead of a nucleic fasta file
hydrogenases/ : contains the fasta (nucleic and amino-acids) of the hydrogenases
phages/ :contains the fasta (nucleic and amino-acids) of the viral genes found (if applicable)
OK thanks for these explanations - I'm going to be going through them this week (one minor thing to fix is that you should have X.table for the extension above). In the meantime, I assume the fact I still have the extra Krona files is due to an error at the end of running that wasn't able to produce the final Krona chart (I don't have the final krona.html file):
[13:10:35] Annotation finished successfully.
[13:10:35] Walltime used: 2185.72 minutes
[13:10:35] If you use this result please cite the Metascan paper:
doi:https://doi.org/10.3389/fbinf.2022.861505
[13:10:35] This script is based on: Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics. 30(14):2068-9.
[13:10:35] Type 'prokka --citation' for more details.
[13:10:35] ************************************
[13:12:33] Deleting unwanted file: input_fasta_assembled//analyzedfastas.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//gensum.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//gendepthsum.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//orgdepthsum.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//keggsum.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//file_hash.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//file_locus01.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//file_locusVQ.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//file_contid.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//file_idcont.txt
[13:12:33] Running: ktImportText input_fasta_assembled\/\/krona\.g\.tsv\,Genes input_fasta_assembled\/\/krona\.gd\.tsv\,Gene\ Depth input_fasta_assembled\/\/krona\.o\.tsv\,Organisms input_fasta_assembled\/\/krona\.od\.tsv\,Organism\ Depth input_fasta_assembled\/\/krona\.mod\.g\.tsv\,Modules\ Genes input_fasta_assembled\/\/krona\.mod\.gd\.tsv\,Modules\ Gene\ Depth input_fasta_assembled\/\/krona\.mod\.o\.tsv\,Modules\ Organisms input_fasta_assembled\/\/krona\.mod\.od\.tsv\,Modules\ Organism\ Depth input_fasta_assembled\/\/krona\.proc\.g\.tsv\,Process\ Genes input_fasta_assembled\/\/krona\.proc\.gd\.tsv\,Process\ Gene\ Depth input_fasta_assembled\/\/krona\.proc\.o\.tsv\,Process\ Organisms input_fasta_assembled\/\/krona\.proc\.od\.tsv\,Process\ Organism\ Depth -o input_fasta_assembled//krona.html
sh: 1: ktImportText: not found
[13:12:34] Could not run command: ktImportText input_fasta_assembled\/\/krona\.g\.tsv\,Genes input_fasta_assembled\/\/krona\.gd\.tsv\,Gene\ Depth input_fasta_assembled\/\/krona\.o\.tsv\,Organisms input_fasta_assembled\/\/krona\.od\.tsv\,Organism\ Depth input_fasta_assembled\/\/krona\.mod\.g\.tsv\,Modules\ Genes input_fasta_assembled\/\/krona\.mod\.gd\.tsv\,Modules\ Gene\ Depth input_fasta_assembled\/\/krona\.mod\.o\.tsv\,Modules\ Organisms input_fasta_assembled\/\/krona\.mod\.od\.tsv\,Modules\ Organism\ Depth input_fasta_assembled\/\/krona\.proc\.g\.tsv\,Process\ Genes input_fasta_assembled\/\/krona\.proc\.gd\.tsv\,Process\ Gene\ Depth input_fasta_assembled\/\/krona\.proc\.o\.tsv\,Process\ Organisms input_fasta_assembled\/\/krona\.proc\.od\.tsv\,Process\ Organism\ Depth -o input_fasta_assembled//krona.html
PS: As a follow-up to the above, you do not mention that Krona is a dependency...but do you not have it installed in the conda env which sets up Metascan? I am running inside your latest conda.
Hello again, I closed the previous issue I was having as I feel we resolved this - the program ran now after I did a first assembly step to reduce the complexity of a full MGS file of raw data.
Now I'd like to discuss the output (now that I have some!) - could you give a brief overview of what all the files are? The upper set of them appear to be for Krona graphs, but what are all the sub-extension/versions? The metagenome.tsv appears to be all the hits summarized:
And then when I look into the total.tsv, could you describe exactly what the numbers are? I'm assuming total # of hits, which could be different from # of contigs (if multiple copies), then number of organisms found in, which here is 1 since I gave it only one FASTA file (even if multiple contigs in file from one sample)? The %gene seems a bit high if that is simply the 232 divided by the total # of genes found in all the contigs - would imply only about 10,000 genes to get about 2%:
Thanks!