DessimozLab / OMArk

GNU Lesser General Public License v3.0
47 stars 6 forks source link

Fragmented proteins #22

Closed DiegoSafian closed 3 months ago

DiegoSafian commented 10 months ago

Hi,

Is it possible to try to use the fragmented and partial hits to edit the annotation file? In my case, I think my annotation has several fragmented genes, so It would be incredibly useful If one could use this information to update the gtf.

My best, Diego

YanNevers commented 10 months ago

Hello @DiegoSafian ,

You can find the information on Fragmented genes in the file with extension: .ump; under categories ending as fragmented. I believe this, plus the OMAmer file, could give you hints into what genes are actually fragmented in your annotation.

We have no way to easily use this information to directly edit annotation file for now. It is a somehow complex tasks because sometimes only one or a few part of the sequence is annotated, fragments can result from fragmented assemblies for example, and there may be edgecase where editing the annotation introduce errors (Two fragments of tandem duplicated genes,comes to mind). We would be happy to provide help, but an automatic script to do this from OMArk results would need to be rigorously tested beforehand.

In the meantime, I am working on a companion Notebook to facilitate access to the fragment and partial mapping information in OMArk and identify fragments that likely corresponds to the same gene, this will I hope be already of some help. I'll see what I can add to it to provide "correction" suggestions.

I will post an update here when this is ready (Next week at the latest).

YanNevers commented 9 months ago

Hello Diego,

Sorry for the delay, our implementation of fragment handling was delayed by changes we made to the OMAmer and OMArk code as part of an improved release. This release is now done, and you can find the Contextualize_OMA.ipynb Jupyer Notebook in our utils folder. There is a section in this Notebook to help correct fragments using information from our OMA database.

It will select all HOGs with detected fragments in your results and download protein sequences of close homologs to help you investigate this fragmentation. We then recommand using Miniprot to map these sequences into the genomes. Note that it works only with the latest release of OMAmer databases, since it uses the OMA Browser API to obtain sequences. We tested this code to investigate fragment in known fragmented proteome and, with the help of a genome browser, we could identify the correct boundaries of the fragmented proteins.

As said before, implementing this automatically would require work, but I would be happy to receive feedback on this tool and advice on how to proceed further.

Thank you for your patience!

DiegoSafian commented 9 months ago

Hi Yan, Thanks for the update. I will check it immediately!