DessimozLab / OMArk

GNU Lesser General Public License v3.0
53 stars 6 forks source link

HOG taxonomy seems buggy #28

Open MmasterT opened 7 months ago

MmasterT commented 7 months ago

Hello I'm trying to understand how omark identyfies what taxonomic rank to use. I'm checking the OMArk result for an annotation of Bombus terrestris (common bee) and specifying the -r order flag it says that it is too narrow or it does not exist, however the one used in the default is Aculeata wich is the infraorder for this particular species, why is this happenning?.

Then the HOGs used to check the consistency are different to the ones used to check the completeness:

16727 HOGs are associated to the query's lineage and will be used for consistency assesment 6496 conserved ancestral HOGs will be used from completeness assesment

I'm including the options used and stderr, and stdout down below:

/usr/bin/time -v omark -v --taxid 30195 --og_fasta annotation.faa  --database /databases/omark/15Nov2023/All/LUCA.h5 --isoform_file annotation.splice -f annotation.omamer  -r order -o ./omark_output
INFO: Starting OMArk
INFO: Input parameters passed validity check
INFO: Extracting data from input file: annotation.omamer
INFO: An isoform_file was provided.
INFO: Extracting data from isoform file  annotation.splice
INFO: Determinating species composition from HOG placements
INFO: A taxid was provided. The query taxon is Apinae
INFO: The provided taxonomic rank order was not an option (too narrow or absent from our lineage option). Default ancestral lineage will be used.
INFO: Ancestral lineage is Aculeata
INFO: Estimating ancestral and conserved HOG content
INFO: 16727 HOGs are associated to the query's lineage and will be used for consistency assesment
INFO: 6496 conserved ancestral HOGs will be used from completeness assesment
INFO: Comparing the query gene repertoire to lineage-associated HOGs
INFO: Comparing the query gene repertoire to conserved ancestral HOGs
INFO: Writing OMArk output files
INFO: Done
YanNevers commented 7 months ago

Hello!

Thank you for reaching out! There is indeed an issue in the way OMArk is looking at taxonomic rank in this case. This is because Hymenoptera (the order) is not explicitly stored in the OMAmer database for species sampling reasons (all Hymenoptera in OMA are also Apocrita, so only the most specific grouping is stored). The way the rank checking is implemented does not handle well this scenario. I agree this is not ideal, I will work on a fix and issue it as soon as possible.

In the meantime, in this particular case you can obtain the same results as if you were using Hymenoptera as a clade of interest by using the taxid of Apocrita rather than the one of your species "-t 7400".

Regarding the HOGs number, it is actually expected than the number of conserved HOGs is lower than the one of the lineage associated HOGs as the former is a subset of the later - only those HOGs found in more than 80% of the species of the clade.

Best wishes, Yannis