bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
266 stars 67 forks source link

OTU table #17

Closed davidvilanova closed 7 years ago

davidvilanova commented 7 years ago

Hi, I was wondering if you have already a script to build an OTU table from a kaiju output ?? In my case i have used Kaiju to assign taxonomy to different bins.

Thanks, david

pmenzel commented 7 years ago

Hi, did you try kaijuReport? It will summarize the percentage and number of reads for each taxon for a given rank. For example, the output of kaijuReport looks like this when using -r genus:

        %           reads       genus
-------------------------------------------
26.355562         1324367       Acidithiobacillus
13.232080          664912       Thermoproteus
11.305055          568079       Thiomonas
 1.762607           88571       Sulfolobus
-------------------------------------------
 0.515284           25893       Viruses
 5.868995          294917       classified above rank genus 
 9.373433          471015       belong to a genus with less than 1% of all reads
-------------------------------------------
31.586984         1587246       unclassified

Otherwise you can also use kaiju2krona, which prints the read counts for each taxon at all ranks.

Peter

davidvilanova commented 7 years ago

Thanks. The OTU table is very useful for other applications such as alpha div and beta div ( i personnaly use the phyloSeq package). The following R snippet is using the output from addTaxoNames that looks like these (i prefer this format because it has all taxonomy (not just one rank). In my case for one sample i have several bins and so several summary files.

C       k141_6665       216816  Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum; 
C       k141_3896       216816  Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum; 
C       k141_2569       216816  Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum; 
C       k141_5693       216816  Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum; 
C       k141_6214       216816  Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum; 
C       k141_707        216816  Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum; 
C       k141_3191       216816  Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum; 

The you parse the summary files for one sample:

path="../summaryfiles/"  # folder of your summary files for one sample
filelist = list.files(path,pattern="*.summary")

df=list()
for (i in 1:length(filelist))
{
  a <- read.table(file=paste(path,filelist[i],sep=""),sep=";",header=F,stringsAsFactors = F)
  temp= strsplit(as.character(a[,1]),split="\t")
  mat <- matrix(unlist(temp), ncol=4, byrow=TRUE)
  a[,1] <- mat[,4]
  a <- data.frame(a[1:ncol(a)-1])
  colnames(a) = c("Phylum",  "Class"  , "Order" ,  "Family" , "Genus" ,  "Species")
  df[[i]] <- data.frame(a)
  cat("Read file:",filelist[i],"\n")

}

#Dataframe with all summaries files into one taxonomic file
alldf = Reduce(function(...) merge(..., all=T), df)
otutable <- aggregate(counts ~., data=transform(alldf,counts=1), length)

The file looks like these. so far only 1 sample. Iterating through all summary files of all your samples it would be straight to get the OTU table.

"Phylum"    "Class" "Order" "Family"    "Genus" "Species"   "counts"
"1" "Firmicutes"    " Bacillales"   " Bacilli"  " Bacillaceae"  " Bacillus" " Bacillus subtilis"    52
"2" "Bacteroidetes" " Bacteroidales"    " Bacteroidia"  " Bacteroidaceae"   " Bacteroides"  " Bacteroides vulgatus" 60
"3" "Actinobacteria"    " Bifidobacteriales"    " Actinobacteria"   " Bifidobacteriaceae"   " Bifidobacterium"  " Bifidobacterium longum"   27
pmenzel commented 7 years ago

I am wondering if there is a more or less well defined definition of an OTU for metagenomic sequencing, which is not as straightforward compared with 16S amplicon sequencing. In your example, do you count the last entry of each line as a distinct taxon or only when a read is assigned down to the species level?

For example, it could be that 1000 reads are just assigned to the genus level and no read is assigned to a species in that genus:

1000  Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; 

That means, there is at least one member of that genus present, so it should be counted as one OTU, which is a lower bound, because there could as well be more than one species from that genus present.

At the same time, if we get reads assigned to species and genus, which frequently happens, like

100  Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium;
1000  Actinobacteria; Bifidobacteriales; Actinobacteria; Bifidobacteriaceae; Bifidobacterium; Bifidobacterium longum;

Then, I would only count that as one OTU and not two, since we identified one species in the genus without evidence for a second species. This serves as a lower bound again.

davidvilanova commented 7 years ago

The way i build the OTU is the same used for 16s, working on identical taxonomical units. So in your second case it will be counted twice, since one is failing to assign a species it could correspond to another species. Focusing in your second example: Si i would have an OTU with 100 times unknown species and a second OTU with 1000 times the same species:

For metagenomics I´m not sure this is a valid approach as for 16s. I haven´t find seen much around. But how would you compare samples or do alpha diversity studies ?

davidvilanova commented 7 years ago

Peter, I have checked out and found that common (still experimental) practice is to build an OTU table based on matching your proteins to a custom gene reference database (such as IMG). You can still kaiju but directly from your predicted genes or proteins. Thanks

mattoslmp commented 1 year ago

When I run this script on my kaiju summary files, is showed this results: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 8 elements