DessimozLab / OMArk

GNU Lesser General Public License v3.0
53 stars 6 forks source link

Question about Inconsistent HOG #23

Closed xiaoyezao closed 4 months ago

xiaoyezao commented 11 months ago

Dear OMArk designer,

Thanks for this great tool. I have run omamer and then OMArk on my own proteomes. In the OMArk results, there is one category "inconsistent". The genes in this category are supposed to be dubious gene models or spurious annotations of non-coding regions. But I am wondering if they are not real genes why do they share similarities with alien HOGs. Could it be possible that some HOGs are not recovered in the reference database?

In the OMA reference database, sampling on shallow phylogeny (e.g., below family level) is limited. For example, I am working on the plant family Asteraceae, and trying to map my proteomes to OMA HOGs. In the OMA reference database, there are only two species in Asteraceae and 11 species in the higher node asterids. Assuming that some genes are missing in the reference genome because of incomplete assembly/annotation, the related HOG would be also missing in the OMA database. As a result, when we do omamer mapping, the genes belonging to the missing HOGs will be misplaced to HOGs outside the target lineage.

To move forward with the OMArk results, should I just discard these inconsistent genes or keep them?

Looking forward to hearing about your thoughts.

Thank you,

Tao

YanNevers commented 11 months ago

Dear Tao,

Thanks for your nice words. In short, you are right that genes absent from the OMA database but "true" would be labeled as Inconsistent by OMArk if this gene family is documented outside of the target taxon. It is thus best not to discard them, especially if they are not considered as fragments or partial mapping.

In the current version of OMArk, genes with no detectable homologs (which may be dubious genes models) tends to be placed as Inconsistent as well, generally with some structural defects (Fragment/Partial mapping). After publishing the OMArk preprint, we realised that this behaviour was due to a tendency of OMArk to place even things with no clearly significant k-mer sharing signal.

We will very soon (we are currently aiming for next week) release a new version of OMAmer with an updated statistical model of the k-mer distribution in our database, that will now avoid placing these sequences. They will now appear in the Unknown categories in OMArk. With this new version, proteins placing to Inconsistent will either be undetected contaminations or genes family not clearly known to exist in the taxon of interest; leaving most of the dubious gene models in the Unknown category.

With this new version, like in the current version, we do not recommend throwing out genes placed in the Unknown category or the Inconsistent category just based on OMArk results. You can use the size of these categories to chose between different annotations of your genome (all things being equal, having more "Consistent" genes is better) . In order to filter out the dubious genes, you could submit the genes in the Inconsistent and Unknown category to other tests to decide whether is best to keep them: do you have transcript evidence for these genes? Are the protein they code for short and low complexity? Do they contain repeat?
We think it is best to combine these different factors to carry on with filtering.

We hope it answers your question. We will let you know here when we make the new omamer and OMArk versions available.

xiaoyezao commented 11 months ago

Thank you so much! I am looking forward to the new versions.

YanNevers commented 10 months ago

Dear Tao,

It took longer than we expected, but the new versions of OMArk and OMAmer have been released. You can try it our by installing OMAmer>=2.0.0 and OMArk>=0.3.0 via pip! Or by cloning both repository and installing it locally The new versions require a new OMAmer, leaner, OMAmer database. You can download it on (https://omabrowser.org/oma/current/), they are the ones with suffix 2.0.0.

Changes to OMAmer and OMArk are described in their latest release note. The main changes are a new measure of statistical significance in OMAmer and an optimisation of its code to make it faster and consume less memory. The code of OMArk has been modified to accomodate the new OMAmer version, but its behaviour is identical as before. The main differences you will notice is that many proteins that were before Inconsistent will be classified as Unknown because of improved placement. Some may stays inconsistent in particular in presence of low complexity region, but it is minimal.

As mentionned before. placement of proteins in Unknown and Inconsistent categories by OMArk is not alone proof that they are wrong, but you can consider this in combination with other information to decide if they are false positive or not.