JensUweUlrich / Taxor

Fast and space-efficient taxonomic classification of long reads
BSD 3-Clause "New" or "Revised" License
41 stars 2 forks source link

Interpretation of cami-report vs binning-file #7

Closed dalofa closed 1 month ago

dalofa commented 1 month ago

Hi,

Once again thanks for writing taxor!

I am wondering how I should interpret the CAMI-report vs the binning-file. Looking at the CAMI-report it seems the read stem from two species:

cat barcode43.cami
@SampleID:barcode43
@Version:0.10.0
@Ranks:superkingdom|phylum|class|order|family|genus|species
@@TAXID RANK    TAXPATH TAXPATHSN       PERCENTAGE
2       superkingdom    2       Bacteria        100
1224    phylum  2|1224  Bacteria|Pseudomonadota 12.9099
1239    phylum  2|1239  Bacteria|Bacillota      87.0901
28216   class   2|1224|28216    Bacteria|Pseudomonadota|Betaproteobacteria      12.9099
91061   class   2|1239|91061    Bacteria|Bacillota|Bacilli      87.0901
1385    order   2|1239|91061|1385       Bacteria|Bacillota|Bacilli|Bacillales   87.0901
80840   order   2|1224|28216|80840      Bacteria|Pseudomonadota|Betaproteobacteria|Burkholderiales      12.9099
119060  family  2|1224|28216|80840|119060       Bacteria|Pseudomonadota|Betaproteobacteria|Burkholderiales|Burkholderiaceae     12.9099
90964   family  2|1239|91061|1385|90964 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae 87.0901
1279    genus   2|1239|91061|1385|90964|1279    Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus  87.0901
1822464 genus   2|1224|28216|80840|119060|1822464       Bacteria|Pseudomonadota|Betaproteobacteria|Burkholderiales|Burkholderiaceae|Paraburkholderia    12.9099
1290    species 2|1239|91061|1385|90964|1279|1290       Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus hominis   87.0901
134536  species 2|1224|28216|80840|119060|1822464|134536        Bacteria|Pseudomonadota|Betaproteobacteria|Burkholderiales|Burkholderiaceae|Paraburkholderia|Paraburkholderia caledonica        12.9099

However, looking at the binning file, only 2 out of 8293 reads are assigned a TAXID. Does this mean that the abundance estimation is based on only these two reads?

JensUweUlrich commented 1 month ago

Yes, for the CAMI report file, your assumption is correct. In such a case, I recommend to use the sequence abundance file, which also takes into account the unclassified reads for abundance estimation in the whole sample.

dalofa commented 1 month ago

Thanks a ton!