Explanation of Outputs - Githubissues

amcomeau commented 1 week ago

Hello again, I closed the previous issue I was having as I feel we resolved this - the program ran now after I did a first assembly step to reduce the complexity of a full MGS file of raw data.

Now I'd like to discuss the output (now that I have some!) - could you give a brief overview of what all the files are? The upper set of them appear to be for Krona graphs, but what are all the sub-extension/versions? The metagenome.tsv appears to be all the hits summarized:

output

And then when I look into the total.tsv, could you describe exactly what the numbers are? I'm assuming total # of hits, which could be different from # of contigs (if multiple copies), then number of organisms found in, which here is 1 since I gave it only one FASTA file (even if multiple contigs in file from one sample)? The %gene seems a bit high if that is simply the 232 divided by the total # of genes found in all the contigs - would imply only about 10,000 genes to get about 2%:

output2

Thanks!

gcremers commented 4 days ago

I have created the following overview (see below), which I will also include in the frontpage.

First off, the krona files should have been deleted after the end of the analysis (except for the krona.html file). The others are intermediate files to create the krona plot. I'm not sure what happened there, unless the --debug flag was on or if they are left-overs from the run that crashed.

The number of 10.000 genes that your referring to would be the total number of key-genes that were found by Metascan. Not the total number of every gene present. So the numbers in the total.ovw file give an estimate of the potential of the genes or processes within the larger metabolic pathways in a sample.

I tried to explain the numbers in the total.ovw file in the explanation below. I hope it makes sense. if not, let me know!

bin.id : list of the bins and their directory name, as created by metascan
depths.bins : file with the depth of each bin (if applicable)
krona.html : krona file containing the information of the total analysis (see total.ovw)
metagenome.tsv : all metabolic annotated proteins in tab format
mod.tsv : overview of the modules (similar to the Kegg modules https://www.genome.jp/kegg/module.html)
proc.tsv : overview of the processes (similar to the Kegg processes)
phage.tsv : overview file for phage proteins (if applicable)
prodigal.txt : data file containing raw prodigal information.
ribosomal.ovw : overview file of the ribosomal RNAs per bin
total.tsv : total overview of the analysis in tab format(see total.ovw) -contains all metabolic genes, if applicable)
total.ovw : overview file of the analysis of the key genes

Each metabolic cycle is represented by a set of key-genes.
- N#gene : number of times the key-gene was found in the total analysis
- %gene : % of the gene compared to the total key-genes found
- N#org : total number of organism (bins) the key-genes are found in.
- %Org : % of the organism that have this key-gene, compared to the total organism found
(If a depth(coverage) file is supplied:
- O-Depth : total depth of all the organism that have this key-gene.
- %O-Depth : % of the depth of all organism with the key-gene, compared to the total depth of all bins
- G-Depth : total depth of all the genes in the set (thus adjusted for multi-copies of key-genes in a genome)
- %G-Depth : % of the depth of all genes in the set compared to the total depth of all key-genes
So if we have two genes (a and b) and three bins (I, II, III), with the following depth:

I abb II aa III b

x x x

x x x

x x

x

x

x

I abb	II aa	III b
x	x	x
x	x	x
x	x
x
x
x

Total gene count = 6 (3(abb) + 2(aa) +1(b))

Total organism count = 3 (1(I)+1(II)+1(III))

Total depth = 11 (6+3+2)

Total gene depth = 26 ( (3x6) + (2x3) + (1x2))

this would yield the following outcome:

gene	N#gene	%gene	N#org	%Org	O-Depth	%O-Depth	G-Depth	%G-Depth
a	3 (1+2+0)	50% (3/6)	2 (1+1+0)	66% (2/3)	9 (6+3+0)	81.2% (9/11)	12 (6+(3+3)+0)	46.2 (12/26)
b	3 (2+0+1)	50% (3/6)	2 (1+0+1)	66% (2/3)	8 (6+0+2)	72.7% (8/11)	14 ((6+6)+0+2)	53.9 (14/26)

Besides the generic overview files, Metascan creates a number of files for each bin/(meta)genome/fasta file.

XXXXXXXX.ovw : overview file of the keygenes in the bin.
XXXXXXXX.tsv : overview file in tab format.
XXXXXXXX.gbk : NCBI genbank file.
XXXXXXXX.gff : gff file
XXXXXXXX.fna : fna file
XXXXXXXX.fsa : fsa file
XXXXXXXX.sqn : sequin file
XXXXXXXX.embl : ENA embl file
XXXXXXXX.log : log file
XXXXXXXX.f16 : fasta file containing rRNA sequences
XXXXXXXX.tabel : feature table
XXXXXXXX.txt : general info on the annotation
XXXXXXXX.kegg : File that can be used to reconstruct pathwyas in KEGG (https://www.genome.jp/kegg/mapper/reconstruct.html)
XXXXXXXX.fall : all genes in bases (CDS and rRNA)
XXXXXXXX.hmm.faa : Genes found through the HMM algorithm. I.e. the metabolic genes (Amino Acids)
XXXXXXXX.hmm.ffn : Genes found through the HMM algorithm. I.e. the metabolic genes (Nucleic Acids)
XXXXXXXX.all.faa : All annotated genes (both through HMM (metabolic) and the legacy Prokka annotation (non metanolic)) (Amino Acids)
XXXXXXXX.all.ffn : All annotated genes (both through HMM (metabolic) and the legacy Prokka annotation (non metanolic)) (Nucleic Acids)
XXXXXXXX.total.sort.tbl : (sorted, by score) intermediate file of the hits found by Metascan for each CDS
XXXXXXXX.total.uniq.tbl : intermediate file containing the top hit for each CDS
XXXXXXXX.aaonly.tsv : Overview file when using pre gene-called ORFs instead of a nucleic fasta file
hydrogenases/ : contains the fasta (nucleic and amino-acids) of the hydrogenases
phages/ :contains the fasta (nucleic and amino-acids) of the viral genes found (if applicable)

amcomeau commented 2 days ago

OK thanks for these explanations - I'm going to be going through them this week (one minor thing to fix is that you should have X.table for the extension above). In the meantime, I assume the fact I still have the extra Krona files is due to an error at the end of running that wasn't able to produce the final Krona chart (I don't have the final krona.html file):

[13:10:35] Annotation finished successfully.
[13:10:35] Walltime used: 2185.72 minutes
[13:10:35] If you use this result please cite the Metascan paper:
doi:https://doi.org/10.3389/fbinf.2022.861505
[13:10:35] This script is based on: Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics. 30(14):2068-9.
[13:10:35] Type 'prokka --citation' for more details.
[13:10:35] ************************************
[13:12:33] Deleting unwanted file: input_fasta_assembled//analyzedfastas.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//gensum.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//gendepthsum.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//orgdepthsum.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//keggsum.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//file_hash.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//file_locus01.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//file_locusVQ.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//file_contid.txt
[13:12:33] Deleting unwanted file: input_fasta_assembled//file_idcont.txt
[13:12:33] Running: ktImportText input_fasta_assembled\/\/krona\.g\.tsv\,Genes input_fasta_assembled\/\/krona\.gd\.tsv\,Gene\ Depth input_fasta_assembled\/\/krona\.o\.tsv\,Organisms input_fasta_assembled\/\/krona\.od\.tsv\,Organism\ Depth input_fasta_assembled\/\/krona\.mod\.g\.tsv\,Modules\ Genes input_fasta_assembled\/\/krona\.mod\.gd\.tsv\,Modules\ Gene\ Depth input_fasta_assembled\/\/krona\.mod\.o\.tsv\,Modules\ Organisms input_fasta_assembled\/\/krona\.mod\.od\.tsv\,Modules\ Organism\ Depth input_fasta_assembled\/\/krona\.proc\.g\.tsv\,Process\ Genes input_fasta_assembled\/\/krona\.proc\.gd\.tsv\,Process\ Gene\ Depth input_fasta_assembled\/\/krona\.proc\.o\.tsv\,Process\ Organisms input_fasta_assembled\/\/krona\.proc\.od\.tsv\,Process\ Organism\ Depth -o input_fasta_assembled//krona.html
sh: 1: ktImportText: not found
[13:12:34] Could not run command: ktImportText input_fasta_assembled\/\/krona\.g\.tsv\,Genes input_fasta_assembled\/\/krona\.gd\.tsv\,Gene\ Depth input_fasta_assembled\/\/krona\.o\.tsv\,Organisms input_fasta_assembled\/\/krona\.od\.tsv\,Organism\ Depth input_fasta_assembled\/\/krona\.mod\.g\.tsv\,Modules\ Genes input_fasta_assembled\/\/krona\.mod\.gd\.tsv\,Modules\ Gene\ Depth input_fasta_assembled\/\/krona\.mod\.o\.tsv\,Modules\ Organisms input_fasta_assembled\/\/krona\.mod\.od\.tsv\,Modules\ Organism\ Depth input_fasta_assembled\/\/krona\.proc\.g\.tsv\,Process\ Genes input_fasta_assembled\/\/krona\.proc\.gd\.tsv\,Process\ Gene\ Depth input_fasta_assembled\/\/krona\.proc\.o\.tsv\,Process\ Organisms input_fasta_assembled\/\/krona\.proc\.od\.tsv\,Process\ Organism\ Depth -o input_fasta_assembled//krona.html

amcomeau commented 2 days ago

PS: As a follow-up to the above, you do not mention that Krona is a dependency...but do you not have it installed in the conda env which sets up Metascan? I am running inside your latest conda.

gcremers / metascan

Explanation of Outputs #5