Closed Feat-FeAR closed 6 months ago
Sorry for taking so long to reply. This is not a trivial thing to fix, in my mind.
Using a seed in bonsai
defeats the purpose of using uuid
s, so I think It would be better to be fixed in make_genesets.py
.
The sorting of the nodes based on their depth is correct, but of course all the same-depth nodes will inevitably end up in random order. This second-order sorting is the issue.
As to how to fix it, I can only think of sorting the nodes based on the content of the node themselves, somehow. Each node content is a list of ENSG
s (with pattern ENSG[0-9]{11}(?:\.[0-9]+)?
), and there (should not be) any duplicates (but it's better to check).
This means that we could sort based on some digest of the content of each node, making new "static" and reproducible IDs (if we start with the same MTP-DB, but the plot should change if the MTP-DB changes), by e.g. hashing with some algorithm.
Other than that, the full path to the node (from the root) should also be unique (emphasis on should), and we could sort based on that. It could also be more stable than the hashing, maybe.
As if there is a best order, I think the question is badly put. The goal of the pruning is to remove redundancy. It already does this in a satisfactory manner: given a similarity score, the unpruned nodes are significantly unique and therefore survive. I think that there is no "best order" for pruning, as the pruning itself is a way to approximate the original tree that has all of the information for "human consumption". If you see pruning as a way to "compress" the tree in a "lossy" way, then I guess there could be a best sorting order. Even if we could compute a score on how well the tree was "compressed", however, we would need to compute all possible (or nearly all possible) pruning orderings and compute the score for each, which would be prohibitively expensive.
I'm attempting to fix this in #34, but as I've said, it might be difficult. I suspect that the process of generating the nodes is somewhat random, resulting in diffs we cannot really fix at the level of pruning.
Even by adding:
The tree is still slightly randomly pruned, buy +- 1 node, as far as I can see. I don't think we can fix this, as I have no idea why this is not working right.
The best thing is to run the analysis once, keep the genesets.json
data static, and re-use it for all plots.
I still think that the meaning of the analysis does not change if the tree is slightly different, but I can see that it might be a problem. In any case, closing this as wontfix
.
Node pruning, as currently performed by
make_genesets.py
, is not deterministic and thus not reproducible. (Try running it many times and see how many nodes you get left if you don't believe it...).This happens with both
bottomup
andtopdown
prune_directions.I am pretty sure that this issue comes from the random IDs assigned by bonsai to nodes, which make the order of comparisons done by
is_similar(node, other_nodes)
unpredictable, given that leaves are sorted by ID...https://github.com/TCP-Lab/transportome_profiler/blob/0ea5270c2cfa65feef72b9d0cc5a3b12666170c2/src/modules/make_genesets.py#L99-L110
I checked, and both
leaves
andother_nodes
lists are randomly sorted. We could set the random number seed in bonsai (I think), but the points are: did this affect prune_direction? If the pruning outcome depends on node sorting, is there a best order for leaves to be sorted before pruning?