KeyError: 13 when running DESMAN

alexcritschristoph commented 6 years ago

Hi Chris and other DESMAN devs,

I have been obtaining the error below when running DESMAN. Here's an example of how I'm running it:

desman ../outputsel_var.csv -e ../outputtran_df.csv -o ClusterEC_2_1 -r 1000 -i 100 -g 2 -s 1

And attached are my variant output files. Any idea what gives? outputtran_df.txt outputsel_var.txt

PS - What exactly is the outputtran_df file anyways?

Traceback (most recent call last):
  File "/home/alexcc/.pyenv/versions/2.7/bin/desman", line 4, in <module>
    __import__('pkg_resources').run_script('desman==0.1.dev0', 'desman')
  File "/home/alexcc/.pyenv/versions/2.7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 750, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/alexcc/.pyenv/versions/2.7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1527, in run_script
    exec(code, namespace, namespace)
  File "/home/alexcc/.pyenv/versions/2.7/lib/python2.7/site-packages/desman-0.1.dev0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/desman", line 250, in <module>
    main(sys.argv[1:])
  File "/home/alexcc/.pyenv/versions/2.7/lib/python2.7/site-packages/desman-0.1.dev0-py2.7-linux-x86_64.egg/EGG-INFO/scripts/desman", line 165, in main
    output_Results.set_Variant_Filter(variant_Filter)
  File "/home/alexcc/.pyenv/versions/2.7/lib/python2.7/site-packages/desman-0.1.dev0-py2.7-linux-x86_64.egg/desman/Output_Results.py", line 61, in set_Variant_Filter
    self.filtered_position.append(self.position[i])
  File "/home/alexcc/.pyenv/versions/2.7/lib/python2.7/site-packages/pandas/core/series.py", line 623, in __getitem__
    result = self.index.get_value(self, key)
  File "/home/alexcc/.pyenv/versions/2.7/lib/python2.7/site-packages/pandas/core/indexes/base.py", line 2560, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/_libs/index.pyx", line 83, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 91, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 134, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 160, in pandas._libs.index.IndexEngine._get_loc_duplicates
  File "pandas/_libs/index_class_helper.pxi", line 168, in pandas._libs.index.Int64Engine._maybe_get_bool_indexer
KeyError: 13

alexcritschristoph commented 6 years ago

I have fixed this problem by changing line 61 of Output_Results.py to: self.filtered_position.append(self.position[i])

and line 141 to: original_position.append(full_position[i])

This seems like a critical bug and I think it could even possibly result in inaccurate results without error messages in some circumstances. Is it due to recent changes in Pandas?

chrisquince commented 6 years ago

Sorry Alex, I am a bit confused those lines look like the original code?

Regarding your original question, the file tran_df file is the starting estimate for the error matrix, yours look sensible as does your variant file. Are you still having problems?

alexcritschristoph commented 6 years ago

Hi Chris, Thanks for the reply! I think I was working with an out-dated copy of the repo. Updating to the latest and I see that those changes are already there. Things work smoothly now.

Everything else has gone smoothly for me - curious about what you think of my results below - I'm working with 38 samples at much lower coverages (min coverage of 5x, median coverage of 15x, max of 50x across samples). The Haplotype inference script says the best fit for this is 4 strains, but I noticed this graph looks quite different from your example in the documentation. Any other metrics I can rely on to get a feel for the quality of this strain # inference? proteo.pdf

alexcritschristoph commented 6 years ago

Oh wait! I realize I still did make changes, but just incorrectly described them above. I wanted to say that I changed line 61 to:

self.filtered_position.append(list(self.position)[i])

and line 141 to:

original_position.append(list(full_position)[i])

The cast to a list seemed important because I think the pandas object that full_position is cannot access items by index (instead they can be accessed by some kind of row name) as is being attempted here. I think this might have to do with how your outputsel_var.csv header is formatted but I tried several formats and always ended up with the same error above before making these changes.

chrisquince commented 6 years ago

Thanks for this, I think it may be to do with a pandas update. I will look into it and change the code accordingly.

Regarding your posterior deviance, there is one run with G = 2 that is a big outlier that makes it look odd. That is not necessarily a problem, just reflects a run that got stuck in a very suboptimal solution. Since you use G = 4 it does not matter.

alexcritschristoph commented 6 years ago

OK! I guess the question that really circulates in my mind is whether or not there is convincing evidence for N=4 strains in my data, compared to a null hypothesis of 1 strain. I guess this is what the resolve haplotype script does, but I'm wondering if there's is a statistic or visuals (e.g., a PCA of the filtered variants' frequencies showing strong separation between variants labeled as haplotypes) so I can get a sense of how convincing the N strains hypothesis is.

One last thing - I know that DESMAN reports the haplotype assigned to each variant for each run and the abundances of that haplotype in each sample, but I'm having trouble figuring out where this data is in the output of each run.

Those are all of the questions I have - but if you'd like, we could move this discussion somewhere away from this Issue so it focuses on the possible bug above.

Thanks! Alex

chrisquince commented 6 years ago

Hi Alex, Sorry for the slow response again, was in Pakistan. Yes this could definitely be a new thread but just briefly. In my experience the resolvenhap script is actually quite conservative. If it reports multiple strains they are probably real but I agree some visualisation would be nice. You could do a SNP PCA based on relative frequencies across samples and then colour by haplotype but then you will need mixed colours for SNPs that are assigned to multiple samples. Actually a heat map on minor variant frequency would do the same thing. One thing that I definitely need to add is an indication of which SNPs are well explained by the model.

These files are contained in the output directory of the selected run. The relative frequencies are in the Gamma files and the haplotypes in the tau files, I tend to use the star files which are the MAP predictions.

Best Wishes, Chris

chrisquince / DESMAN

KeyError: 13 when running DESMAN #20