brettc / partitionfinder

PartitionFinder discovers optimal partitioning schemes for DNA sequences.
Other
61 stars 44 forks source link

bug in fabricating subsets #61

Closed roblanf closed 9 years ago

roblanf commented 9 years ago

Karen M found this issue with a dataset (which she provided by email so we can debug):

Something to do with fabricating subsets for the kmeans algorithm.

Traceback (most recent call last):
  File "/home/meu00c/CSIRO/Programs/PF/partitionfinder-develop/PartitionFinder.py", line 23, in <module>
    sys.exit(main.main("PartitionFinder", "DNA"))
  File "/home/meu00c/CSIRO/Programs/PF/partitionfinder-develop/partfinder/main.py", line 370, in main
    run_analysis(cfg, options)
  File "/home/meu00c/CSIRO/Programs/PF/partitionfinder-develop/partfinder/main.py", line 306, in run_analysis
    results = anal.analyse()
  File "/home/meu00c/CSIRO/Programs/PF/partitionfinder-develop/partfinder/analysis.py", line 91, in analyse
    self.do_analysis()
  File "/home/meu00c/CSIRO/Programs/PF/partitionfinder-develop/partfinder/logtools.py", line 159, in indented_fn
    fn(*args, **kwargs)
  File "/home/meu00c/CSIRO/Programs/PF/partitionfinder-develop/partfinder/analysis_method.py", line 1003, in do_analysis
    final_scheme = self.finalise_fabrication(start_subsets, step)
  File "/home/meu00c/CSIRO/Programs/PF/partitionfinder-develop/partfinder/analysis_method.py", line 797, in finalise_fabrication
    euclid_dist = spatial.distance.pdist(centroid_array)
  File "/home/meu00c/anaconda/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1178, in pdist
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/home/meu00c/anaconda/lib/python2.7/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
    X = X.astype(np.double)
ValueError: setting an array element with a sequence.
roblanf commented 9 years ago

Update. I couldn't reproduce the error on my machine. However, when I re-ran it with an extra model (GTR+I+G), I did get the bug.

My suspicion is that this is a rare bug, that crops up occasionally when we try and merge subsets. Because we don't have a random number seed, it's hard to reproduce this one.

roblanf commented 9 years ago

The best I can think of doing for now is this.

We fall over when we call spatial.distance.pdist

So for now I'll just run a heap of analyses and check what we pass to that function, i.e.:

                    centroid_array = [sub.centroid, centroid]
                    print centroid_array
                    euclid_dist = spatial.distance.pdist(centroid_array)

If I can figure out when it fails, that's a good start.

roblanf commented 9 years ago

Here's a centroid array that gives us the error. Note the 'none' in the final line:

[[-0.58612193945471691], [24.799193535274632]]
[[-1.0], [24.799193535274632]]
[[-0.49735338396332335], [24.799193535274632]]
[[1.5793039593201781], [24.799193535274632]]
[[2.8262365442513655], [24.799193535274632]]
[[1.3410727258731898], [24.799193535274632]]
[[-0.48584979311026438], [24.799193535274632]]
[[0.63550764497753109], [24.799193535274632]]
[[-1.289706691277938], [24.799193535274632]]
[[0.49178103329441003], [24.799193535274632]]
[[-1.6716222259469751], [24.799193535274632]]
[[0.89895835912967736], [24.799193535274632]]
[[-1.0591706520859276], [24.799193535274632]]
[[0.60825455014161212], [24.799193535274632]]
[[-1.4305875173721956], [24.799193535274632]]
[[1.2801813603340069], [24.799193535274632]]
[[-0.4435925348804895], [24.799193535274632]]
[[2.1026286153331615], [24.799193535274632]]
[[-0.99097354790359327], [24.799193535274632]]
[[-1.1640002715622064], [24.799193535274632]]
[[0.53513758876528017], [24.799193535274632]]
[[-1.497246657928202], [24.799193535274632]]
[[-1.1499609177607493], [24.799193535274632]]
[[2.3772972659886955], [24.799193535274632]]
[[-0.94962884247268708], [24.799193535274632]]
[[1.5653105589129743], [24.799193535274632]]
[[-0.81035569108688943], [24.799193535274632]]
[[0.84560888743079909], [24.799193535274632]]
[[-1.0679285949243082], [24.799193535274632]]
[[0.94612330396208377], [24.799193535274632]]
[[-0.68423624534251248], [24.799193535274632]]
[[1.0195624394520091], [24.799193535274632]]
[[-0.73580072491196113], [24.799193535274632]]
[[-0.54455821167300011], [24.799193535274632]]
[[1.1782259488924929], [24.799193535274632]]
[[0.47522895807972493], [24.799193535274632]]
[[-1.6439951768570358], [24.799193535274632]]
[[-1.0], [24.799193535274632]]
[[3.7723691800736461], [24.799193535274632]]
[[-2.2604494792305467], [24.799193535274632]]
[[1.0], [24.799193535274632]]
[[-1.0], [24.799193535274632]]
[[0.94706680735452753], [24.799193535274632]]
[[0.0], [24.799193535274632]]
[[-1.0], [24.799193535274632]]
[[1.2376044012104674], [24.799193535274632]]
[[1.4197734869510765], [24.799193535274632]]
[[0.53162310582542915], [24.799193535274632]]
[[-1.4101723561215025], [24.799193535274632]]
[[0.62326136233518692], [24.799193535274632]]
[[-0.5523953801949284], [24.799193535274632]]
[[1.3871653956332695], [24.799193535274632]]
[[1.6310904225101077], [24.799193535274632]]
[[-0.57117606853250924], [24.799193535274632]]
[[-0.76975326820651457], [24.799193535274632]]
[[1.1375999627475424], [24.799193535274632]]
[[2.5961232616147978], [24.799193535274632]]
[[-0.3284632837080077], [24.799193535274632]]
[[-0.91963666180302683], [24.799193535274632]]
[[0.96174888665175717], [24.799193535274632]]
[[-1.0], [24.799193535274632]]
[[1.0], [24.799193535274632]]
[[-0.46231912335421083], [24.799193535274632]]
[[-1.7861645722464548], [24.799193535274632]]
[[-0.72838502555423146], [24.799193535274632]]
[[3.5622677593348477], [24.799193535274632]]
[[1.0], [24.799193535274632]]
[[1.0], [24.799193535274632]]
[[-2.5873091453997241], [24.799193535274632]]
[[1.1613644725429884], [24.799193535274632]]
[[0.48438387235931279], [24.799193535274632]]
[[-1.6007122669354434], [24.799193535274632]]
[[-1.0], [24.799193535274632]]
[[1.0], [24.799193535274632]]
[[5.58271522052046], [24.799193535274632]]
[[-4.5208089242972731], [24.799193535274632]]
[[-2.7619309003637462], [24.799193535274632]]
[[0.0], [24.799193535274632]]
[[-1.0], [24.799193535274632]]
[[-0.58658587817688657], [24.799193535274632]]
[[-0.51256881627450379], [24.799193535274632]]
[[1.6493551018733847], [24.799193535274632]]
[[-2.4552791368742377], [24.799193535274632]]
[[-0.87253937117727964], [24.799193535274632]]
[[1.2911918531743534], [24.799193535274632]]
[[-0.64084889771511067], [24.799193535274632]]
[None, [24.799193535274632]]
roblanf commented 9 years ago

So that 'None', refers to a subset with a centroid that's None. I'm getting closer.

roblanf commented 9 years ago

OK, so now I've tracked it down to one massive subset, whose centroid is 'None'.

The subset has 185K sites (!), the first few of which are: 0, 1, 2, 3, 4, 5, 9, 10, 15, 16, 17, 25, 30, 33, 34, 41, 42, 44, 45, 46, 47, 48, 49, 52

What's going on here is that these are sites with no variation. They either have all constant sites, or constant sites and some gaps, or all gaps.

So, I think a sensible workaround here is to just put in a fix where if a subset's centroid is 'None', we call it zero and move on.

Any thoughts @pbfrandsen

roblanf commented 9 years ago

fixed in https://github.com/brettc/partitionfinder/commit/ee7b8c1457d02d03e23b64915750833752a69fd2

pbfrandsen commented 9 years ago

Hi Rob,

This looks like the right solution to me. I just encountered this error on a dataset that I was running overnight, so nice fix!

Paul

On Wednesday, August 12, 2015, roblanf notifications@github.com wrote:

OK, so now I've tracked it down to one massive subset, whose centroid is 'None'.

The subset has 185K sites (!), the first few of which are: 0, 1, 2, 3, 4, 5, 9, 10, 15, 16, 17, 25, 30, 33, 34, 41, 42, 44, 45, 46, 47, 48, 49, 52

What's going on here is that these are sites with no variation. They either have all constant sites, or constant sites and some gaps, or all gaps.

So, I think a sensible workaround here is to just put in a fix where if a subset's centroid is 'None', we call it zero and move on.

Any thoughts @pbfrandsen https://github.com/pbfrandsen

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/61#issuecomment-130277067 .

Sent from my iPad