DessimozLab / OMArk

GNU Lesser General Public License v3.0
53 stars 6 forks source link

Granulosicoccus genera not being classified in a good way by omark #26

Open Bio-finder opened 7 months ago

Bio-finder commented 7 months ago

Hello,

Thank you first for this software which is very easy to use and nice in the insight it provides. I wanted to report an issue in the classification of the genera Granulosicoccus which is classified as a Burkholderiaceae while it is in fact a Gammaproteobacteria ; Chromatiales. I have one example below which is the follwing:

Detected species

Main species

Clade: f__Burkholderiaceae Number of associated query proteins: 524 (9.08%)

Potential Contaminants

Potential contaminant Nº1

Clade: c__Alphaproteobacteria Number of associated query proteins: 375 (6.50%)

Busco classifies well the bacteria which is why I wanted to report this as a possible bug. Did you already observed such issue in the bacterial classification? Best regards,

YanNevers commented 7 months ago

Dear @Bio-finder ,

Thank you for using your tool and for your nice words.

OMArk has only be extensively tested on Eukaryotes so far - mainly because it assume vertical descent which is more common there. As a result, I have not a lot of experience with this kind of error but these results indeed look odd. I would be happy to investigate further if you wish. Could you show me the totality of the .sum file for this particular example? And would you be willing to share the proteome so I can replicate the results and investigate it in more depth?

Thanks again, Yannis

Bio-finder commented 7 months ago

Dear Yannis, do you have a mail to which I could send you a download link. The proteome should not be shared because it's unpublished data. Best regards, Benoît

Bio-finder commented 7 months ago

Here is the totality of the sum file:

The selected clade was f__Burkholderiaceae

Number of conserved HOGs is: 1411

Results on conserved HOGs is:

S:Single:S, D:Duplicated[U:Unexpected,E:Expected],M:Missing

S:1036,D:17[U:17,E:0],M:358 S:73.42%,D:1.20%[U:1.20%,E:0.00%],M:25.37%

On the whole proteome, there are 5773 proteins

Of which:

A:Consistent (taxonomically)[P:Partial hits,F:Fragmented], I: Inconsistent (taxonomically)[P:Partial hits,F:Fragmented], C: Likely Contamination[P:Partial hits,F:Fragmented], U: Unknown

A:3352[P:491,F:104],I:776[P:272,F:29],C:360[P:54,F:6],U:1285 A:58.06%[P:8.51%,F:1.80%],I:13.44%[P:4.71%,F:0.50%],C:6.24%[P:0.94%,F:0.10%],U:22.26%

From HOG placement, the detected species are:

Clade NCBI taxid Number of associated proteins Percentage of proteome's total

f__Burkholderiaceae -1117906034 524 9.08%

Potential contaminants:

c__Alphaproteobacteria -1200549282 375 6.50%

YanNevers commented 7 months ago

Dear Benoit,

Thank you! Given the sum file, it seems that OMArk may be impacted by the fact we have no close relative of this species in our database which hamper finding the right clade, but I still can't explain the Burkholderiaceae picking here - absent HGT. You can send me an email at yannis (dot) nevers (at) unil (dot) ch if you'd like to send me more data for me to look for what went wrong. I will of course keep it locally and delete it afterward to guarantee privacy.