Closed ens-sb closed 8 months ago
Dear Botond
Sure, we would be happy to clarify the method.
The subsampling is completely random. That's true; we can not guarantee a perfect species overlap inference considering the subsampling approach. There could be some cases that species overlap might lead to miss a duplication event.
It might be possible to come up with ways to mitigate this but there could be several gene copies from the same species all over the gene tree. However, in our experience it doesn't significantly affect the overall performance, having low false-positive in terms of orthology pairs.
For small datasets, a pro-user could deactivate this by setting subsampling_hogclass=False
as a hidden option in _config.py
. Also, the number of kept proteins per subHOG (group at specific taxonomic level) could be controlled by hogclass_max_num_seq
. Applying these need reinstalling or having editable installation with pip -e
beforehand.
Best regards, Sina
Dear Sina, thank you very much for the explanation!
Best, Botond
Hello,
I have a question regarding the subsampling strategy used at higher nodes to save on gene tree calculation time: is it purely random or there are some other criteria used as well. Also, when subsampling from the child HOGs is the species overlap criteria to classify nodes guaranteed to work?
Best, Botond