Non-deterministic (nor reproducible) pruning.

Feat-FeAR commented 8 months ago

Node pruning, as currently performed by make_genesets.py, is not deterministic and thus not reproducible. (Try running it many times and see how many nodes you get left if you don't believe it...).

This happens with both bottomup and topdown prune_directions.

I am pretty sure that this issue comes from the random IDs assigned by bonsai to nodes, which make the order of comparisons done by is_similar(node, other_nodes) unpredictable, given that leaves are sorted by ID...

https://github.com/TCP-Lab/transportome_profiler/blob/0ea5270c2cfa65feef72b9d0cc5a3b12666170c2/src/modules/make_genesets.py#L99-L110

I checked, and both leaves and other_nodes lists are randomly sorted. We could set the random number seed in bonsai (I think), but the points are: did this affect prune_direction? If the pruning outcome depends on node sorting, is there a best order for leaves to be sorted before pruning?

MrHedmad commented 8 months ago

Sorry for taking so long to reply. This is not a trivial thing to fix, in my mind. Using a seed in bonsai defeats the purpose of using uuids, so I think It would be better to be fixed in make_genesets.py.

The sorting of the nodes based on their depth is correct, but of course all the same-depth nodes will inevitably end up in random order. This second-order sorting is the issue.

As to how to fix it, I can only think of sorting the nodes based on the content of the node themselves, somehow. Each node content is a list of ENSGs (with pattern ENSG[0-9]{11}(?:\.[0-9]+)?), and there (should not be) any duplicates (but it's better to check). This means that we could sort based on some digest of the content of each node, making new "static" and reproducible IDs (if we start with the same MTP-DB, but the plot should change if the MTP-DB changes), by e.g. hashing with some algorithm.

Other than that, the full path to the node (from the root) should also be unique (emphasis on should), and we could sort based on that. It could also be more stable than the hashing, maybe.

As if there is a best order, I think the question is badly put. The goal of the pruning is to remove redundancy. It already does this in a satisfactory manner: given a similarity score, the unpruned nodes are significantly unique and therefore survive. I think that there is no "best order" for pruning, as the pruning itself is a way to approximate the original tree that has all of the information for "human consumption". If you see pruning as a way to "compress" the tree in a "lossy" way, then I guess there could be a best sorting order. Even if we could compute a score on how well the tree was "compressed", however, we would need to compute all possible (or nearly all possible) pruning orderings and compute the score for each, which would be prohibitively expensive.

MrHedmad commented 8 months ago

I'm attempting to fix this in #34, but as I've said, it might be difficult. I suspect that the process of generating the nodes is somewhat random, resulting in diffs we cannot really fix at the level of pruning.

MrHedmad commented 6 months ago

Even by adding:

Static seeding;
Sorting of raw data from the DB (by ensg);
Sorting nodes by depth plus node content;
A series of stuff I can't really remember;

The tree is still slightly randomly pruned, buy +- 1 node, as far as I can see. I don't think we can fix this, as I have no idea why this is not working right.

The best thing is to run the analysis once, keep the genesets.json data static, and re-use it for all plots.

I still think that the meaning of the analysis does not change if the tree is slightly different, but I can see that it might be a problem. In any case, closing this as wontfix.

TCP-Lab / transportome_profiler

Non-deterministic (nor reproducible) pruning. #33