Closed davidvilanova closed 7 years ago
Hi,
did you try kaijuReport?
It will summarize the percentage and number of reads for each taxon for a given rank.
For example, the output of kaijuReport looks like this when using -r genus
:
% reads genus
-------------------------------------------
26.355562 1324367 Acidithiobacillus
13.232080 664912 Thermoproteus
11.305055 568079 Thiomonas
1.762607 88571 Sulfolobus
-------------------------------------------
0.515284 25893 Viruses
5.868995 294917 classified above rank genus
9.373433 471015 belong to a genus with less than 1% of all reads
-------------------------------------------
31.586984 1587246 unclassified
Otherwise you can also use kaiju2krona, which prints the read counts for each taxon at all ranks.
Peter
Thanks. The OTU table is very useful for other applications such as alpha div and beta div ( i personnaly use the phyloSeq package). The following R snippet is using the output from addTaxoNames that looks like these (i prefer this format because it has all taxonomy (not just one rank). In my case for one sample i have several bins and so several summary files.
C k141_6665 216816 Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum;
C k141_3896 216816 Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum;
C k141_2569 216816 Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum;
C k141_5693 216816 Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum;
C k141_6214 216816 Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum;
C k141_707 216816 Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum;
C k141_3191 216816 Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum;
The you parse the summary files for one sample:
path="../summaryfiles/" # folder of your summary files for one sample
filelist = list.files(path,pattern="*.summary")
df=list()
for (i in 1:length(filelist))
{
a <- read.table(file=paste(path,filelist[i],sep=""),sep=";",header=F,stringsAsFactors = F)
temp= strsplit(as.character(a[,1]),split="\t")
mat <- matrix(unlist(temp), ncol=4, byrow=TRUE)
a[,1] <- mat[,4]
a <- data.frame(a[1:ncol(a)-1])
colnames(a) = c("Phylum", "Class" , "Order" , "Family" , "Genus" , "Species")
df[[i]] <- data.frame(a)
cat("Read file:",filelist[i],"\n")
}
#Dataframe with all summaries files into one taxonomic file
alldf = Reduce(function(...) merge(..., all=T), df)
otutable <- aggregate(counts ~., data=transform(alldf,counts=1), length)
The file looks like these. so far only 1 sample. Iterating through all summary files of all your samples it would be straight to get the OTU table.
"Phylum" "Class" "Order" "Family" "Genus" "Species" "counts"
"1" "Firmicutes" " Bacillales" " Bacilli" " Bacillaceae" " Bacillus" " Bacillus subtilis" 52
"2" "Bacteroidetes" " Bacteroidales" " Bacteroidia" " Bacteroidaceae" " Bacteroides" " Bacteroides vulgatus" 60
"3" "Actinobacteria" " Bifidobacteriales" " Actinobacteria" " Bifidobacteriaceae" " Bifidobacterium" " Bifidobacterium longum" 27
I am wondering if there is a more or less well defined definition of an OTU for metagenomic sequencing, which is not as straightforward compared with 16S amplicon sequencing. In your example, do you count the last entry of each line as a distinct taxon or only when a read is assigned down to the species level?
For example, it could be that 1000 reads are just assigned to the genus level and no read is assigned to a species in that genus:
1000 Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium;
That means, there is at least one member of that genus present, so it should be counted as one OTU, which is a lower bound, because there could as well be more than one species from that genus present.
At the same time, if we get reads assigned to species and genus, which frequently happens, like
100 Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium;
1000 Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum;
Then, I would only count that as one OTU and not two, since we identified one species in the genus without evidence for a second species. This serves as a lower bound again.
The way i build the OTU is the same used for 16s, working on identical taxonomical units. So in your second case it will be counted twice, since one is failing to assign a species it could correspond to another species. Focusing in your second example: Si i would have an OTU with 100 times unknown species and a second OTU with 1000 times the same species:
For metagenomics I´m not sure this is a valid approach as for 16s. I haven´t find seen much around. But how would you compare samples or do alpha diversity studies ?
Peter, I have checked out and found that common (still experimental) practice is to build an OTU table based on matching your proteins to a custom gene reference database (such as IMG). You can still kaiju but directly from your predicted genes or proteins. Thanks
When I run this script on my kaiju summary files, is showed this results: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 8 elements
Hi, I was wondering if you have already a script to build an OTU table from a kaiju output ?? In my case i have used Kaiju to assign taxonomy to different bins.
Thanks, david