balabanmetin / uDance

BSD 3-Clause "New" or "Revised" License
43 stars 6 forks source link

Running the same data twice gives different results. #12

Open nimua opened 2 months ago

nimua commented 2 months ago

Hi, When I was using uDance, I found that after running uDance multiple times, it was generally stable, but different topologies were obtained in some local areas. Is there any way to make the results stable in de-novo mode?

nimua commented 2 months ago

For example, if you run the datasmall twice, the topology will be different.

balabanmetin commented 2 months ago

this is an interesting observation. uDance uses random number seeds whenever possible. There might be some steps in the pipeline where I didn't follow this principle. I'd like to get some help to find out which step causes non-determinism. If you still have the files from the two runs, can you diff the following

  1. backbone/0/species.txt
  2. backbone.nwk
  3. placement.jplace
nimua commented 2 months ago

I think this isn't an isolated case. I tried to install uDance on the one machine, with default parameters, and got different results after running multiple times. I also tried to run uDance in docker, with default parameters and input datasmall, and got different results after running multiple times. Different results may look similar at first glance, with the same number of taxa, but different topological details.

I checked these files:

  1. backbone/0/species.txt same
  2. backbone.nwk slightly different
  3. placement.jplace different
nimua commented 2 months ago

diff_backbone.nwk.txt diff_placement.jplace.txt