Question regarding the subsampling strategy

ens-sb commented 8 months ago

Hello,

I have a question regarding the subsampling strategy used at higher nodes to save on gene tree calculation time: is it purely random or there are some other criteria used as well. Also, when subsampling from the child HOGs is the species overlap criteria to classify nodes guaranteed to work?

Best, Botond

sinamajidian commented 8 months ago

Dear Botond

Sure, we would be happy to clarify the method.

The subsampling is completely random. That's true; we can not guarantee a perfect species overlap inference considering the subsampling approach. There could be some cases that species overlap might lead to miss a duplication event. It might be possible to come up with ways to mitigate this but there could be several gene copies from the same species all over the gene tree. However, in our experience it doesn't significantly affect the overall performance, having low false-positive in terms of orthology pairs. For small datasets, a pro-user could deactivate this by setting subsampling_hogclass=False as a hidden option in _config.py. Also, the number of kept proteins per subHOG (group at specific taxonomic level) could be controlled by hogclass_max_num_seq. Applying these need reinstalling or having editable installation with pip -e beforehand.

Best regards, Sina

ens-sb commented 8 months ago

Dear Sina, thank you very much for the explanation!

Best, Botond

DessimozLab / FastOMA

Question regarding the subsampling strategy #20