chrisquince / DESMAN

De novo Extraction of Strains from MetAgeNomes
Other
69 stars 22 forks source link

error with desman #4

Closed fwhelan closed 7 years ago

fwhelan commented 7 years ago

Good morning,

I'm having an issues with desman which I can't solve:

$ desman ../variants/outputsel_var.csv -e ../variants/outputtran_df.csv -o test -r 1000 -i 100 -g 2 > test.out
Up and running. Check /home/whelanfj/metagenomics/LA-T1-CE-MG-paper/concoct/ray_assembly_read_cov/concoct-output/DESMAN/runDesman/test/log_file.txt for progress
Traceback (most recent call last):
  File "/usr/local/bin/desman", line 5, in <module>
    pkg_resources.run_script('desman==0.1dev', 'desman')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 492, in run_script
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1350, in run_script
  File "/usr/local/lib/python2.7/dist-packages/desman-0.1dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/desman", line 249, in <module>
    main(sys.argv[1:])
  File "/usr/local/lib/python2.7/dist-packages/desman-0.1dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/desman", line 155, in main
    haplo_SNP.update()
  File "/usr/local/lib/python2.7/dist-packages/desman-0.1dev-py2.7-linux-x86_64.egg/desman/HaploSNP_Sampler.py", line 345, in update
    nchange = sampletau.sample_tau(self.tau, self.gamma, self.eta, self.variants)
  File "sampletau.pyx", line 38, in sampletau.sample_tau (sampletau/sampletau.c:1510)
ValueError: Buffer dtype mismatch, expected 'long' but got 'double'
$ less test/log_file.txt 
2016-10-22 15:53:56,118:INFO:root:Results created in /home/whelanfj/metagenomics/LA-T1-CE-MG-paper/concoct/ray_assembly_read_cov/concoct-output/DESMAN/runDesman/test
2016-10-22 15:53:56,118:INFO:root:Set fixed seed for random position selection = 238329
2016-10-22 15:53:56,128:INFO:root:Running Desman with 2 samples and 3344 variant positions finding 2 genomes.
2016-10-22 15:53:56,128:INFO:root:Set eta error transition matrix from = <open file '../variants/outputtran_df.csv', mode 'r' at 0x7f55adca4780>
2016-10-22 15:53:56,129:INFO:root:Selected 1000 random variant positions to infer haplotypes from
2016-10-22 15:53:56,130:INFO:root:Set second adjustable random seed = 23724839
2016-10-22 15:53:56,136:INFO:root:Perform NTF initialisation
2016-10-22 15:53:56,181:INFO:root:NTF Iter 0, div = 1534.034925
2016-10-22 15:53:58,331:INFO:root:NTF Iter 100, div = 70.643808
2016-10-22 15:54:00,459:INFO:root:NTF Iter 200, div = 2.005065
2016-10-22 15:54:02,723:INFO:root:NTF Iter 300, div = 0.823276
2016-10-22 15:54:04,798:INFO:root:NTF Iter 400, div = 0.227905
2016-10-22 15:54:06,874:INFO:root:NTF Iter 500, div = 0.131637
2016-10-22 15:54:08,907:INFO:root:NTF Iter 600, div = 0.091090
2016-10-22 15:54:10,851:INFO:root:NTF Iter 700, div = 0.068861
2016-10-22 15:54:12,952:INFO:root:NTF Iter 800, div = 0.054202
2016-10-22 15:54:14,938:INFO:root:NTF Iter 900, div = 0.043632
2016-10-22 15:54:16,927:INFO:root:NTF Iter 1000, div = 0.035562
2016-10-22 15:54:19,093:INFO:root:NTF Iter 1100, div = 0.029331
2016-10-22 15:54:20,975:INFO:root:NTF Iter 1200, div = 0.024467
2016-10-22 15:54:22,823:INFO:root:NTF Iter 1300, div = 0.021022
2016-10-22 15:54:24,963:INFO:root:NTF Iter 1400, div = 0.017717
2016-10-22 15:54:27,026:INFO:root:NTF Iter 1500, div = 0.014345
2016-10-22 15:54:29,049:INFO:root:NTF Iter 1600, div = 0.012047
2016-10-22 15:54:31,156:INFO:root:NTF Iter 1700, div = 0.010288
2016-10-22 15:54:33,326:INFO:root:NTF Iter 1800, div = 0.008536
2016-10-22 15:54:35,273:INFO:root:NTF Iter 1900, div = 0.007159
2016-10-22 15:54:36,254:INFO:root:Start Gibbs sampler burn-in phase
$ ls -lah test.out
-rw-rw-r-- 1 whelanfj whelanfj 0 Oct 22 15:53 test.out

Much appreciated!

chrisquince commented 7 years ago

Hi Fiona,

That is confusing as the program has started fine and run up until the first Gibbs step. Can you put the input files somewhere where I can download them from so I can try and reproduce this?

Thanks, Chris

fwhelan commented 7 years ago

Hi Chris, I can definitely do that if necessary; however, I've perhaps found an issue further up the pipeline. I'm following the workflow in the readme file for DESMAN and in the step prior to desman, I'm getting an error which I think might be causing the above:

$ python ~/software/DESMAN-master/desman/Variant_Filter.py cluster_esc3_scgs.freq 
/home/whelanfj/software/DESMAN-master/desman/Variant_Filter.py:37: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  new_pvalues = np.zeros(n)

I ignored this because it looked to be a warning, but...

$ head -n 1 cluster_esc3_scgs.freq 
Cog,Position,AIA-A,AIA-C,AIA-G,AIA-T,Beef-A,Beef-C,Beef-G,Beef-T,KVLB-A,KVLB-C,KVLB-G,KVLB-T,MAC-A,MAC-C,MAC-G,MAC-T,McKay-A,McKay-C,McKay-G,McKay-T
$ head -n 1 outputsel_var.csv 
,Position,Beef-A,Beef-C,Beef-G,Beef-T,MAC-A,MAC-C,MAC-G,MAC-T

..this doesn't seem right.

chrisquince commented 7 years ago

Hi Fiona,

No that is not right but the issue is not related to the warning from Variant_Filter.py. You have not successfully created the base frequency file. Did bam-readcount work correctly?

Thanks, Chris

fwhelan commented 7 years ago

The output of bam-readcount looks okay, but let me re-generate and see if I can figure out any differences between the files that worked and the files that didn't.

I'll let you know how it goes- Thanks!

fwhelan commented 7 years ago

I re-did everything from the initial bwa mapping and samtools to bam-readcount and now I only get 1 of 5 samples in my outputsel_var.csv. Puzzled..

Are the base frequency files the .cnt's from bam-readcount? These seem to work fine though I do get warnings in the .err output:

WARNING: In read HWI-1KL153:74:HA43VADXX:1:1201:7365:6558: Couldn't find single-end mapping quality. Check to see if the SM tag is in BAM.

I get these types of warnings for all samples, albeit the size of the .err is smaller for the sample output to outputsel_var.csv compared to the rest. I will admit that these warnings don't make much sense to me; do you think they're contributing to this loss of samples?

Thanks again! --Fiona

fwhelan commented 7 years ago

Or maybe I'm misunderstanding the output of Variant_Filter.py; I assumed that all columns should be present regardless but I'm finding that if I play with the parameters that I get different columns. For example,

$ python ~/software/DESMAN-master/desman/Variant_Filter.py ../cluster_esc3_scgs.freq
$ head -n 1 outputsel_var.csv 
,Position,Beef-A,Beef-C,Beef-G,Beef-T
$rm output*
$ python ~/software/DESMAN-master/desman/Variant_Filter.py ../cluster_esc3_scgs.freq -v 0 -m 0 -q 0 -t 0 -sf 0
$ head -n 1 outputsel_var.csv 
,Position,AIA-A,AIA-C,AIA-G,AIA-T,Beef-A,Beef-C,Beef-G,Beef-T,McKay-A,McKay-C,McKay-G,McKay-T

... I know setting all of these defaults to 0 is silly, but just trying to understand the output.

chrisquince commented 7 years ago

Hi Fiona,

The base frequencies are in the .cnt's perhaps have a look at those to make sure they look sensible? The warnings are nothing to worry about. I should look how to get them switched off though as they can take up a lot of disk space.

The reason columns are appearing as you change parameters is due to the -m flag. This is the minimum mean coverage necessary for a sample to be included. It defaults to 5.0 by setting it to zero you include all samples. You probably do not want to do that but -m 1.0 is reasonable in some situations. I would keep the other parametes at there defaults although you may want to try '-p' to turn on the optimisation in the log-ratio test.

Thanks, Chris

fwhelan commented 7 years ago

Perfect- that makes sense. So we're just back to the original issue then:

$ desman ../variants_yay/outputsel_var.csv -e ../variants_yay/outputtran_df.csv -o cluster_2_0 -r 1000 -i 100 -g 2 -s 0 > cluster_2_0.out
Up and running. Check /home/whelanfj/metagenomics/LA-T1-CE-MG-paper/concoct/ray_assembly_read_cov/concoct-output/DESMAN/run_desman/cluster_2_0/log_file.txt for progress
Traceback (most recent call last):
  File "/usr/local/bin/desman", line 5, in <module>
    pkg_resources.run_script('desman==0.1dev', 'desman')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 492, in run_script
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1350, in run_script
  File "/usr/local/lib/python2.7/dist-packages/desman-0.1dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/desman", line 249, in <module>
    main(sys.argv[1:])
  File "/usr/local/lib/python2.7/dist-packages/desman-0.1dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/desman", line 155, in main
    haplo_SNP.update()
  File "/usr/local/lib/python2.7/dist-packages/desman-0.1dev-py2.7-linux-x86_64.egg/desman/HaploSNP_Sampler.py", line 345, in update
    nchange = sampletau.sample_tau(self.tau, self.gamma, self.eta, self.variants)
  File "sampletau.pyx", line 38, in sampletau.sample_tau (sampletau/sampletau.c:1510)
ValueError: Buffer dtype mismatch, expected 'long' but got 'double'
$ less cluster_2_0.out 
$ less cluster_2_0/log_file.txt 
2016-10-31 09:29:34,248:INFO:root:Results created in /home/whelanfj/metagenomics/LA-T1-CE-MG-paper/concoct/ray_assembly_read_cov/concoct-output/DESMAN/run_desman/cluster_2_0
2016-10-31 09:29:34,248:INFO:root:Set fixed seed for random position selection = 238329
2016-10-31 09:29:34,348:INFO:root:Running Desman with 1 samples and 70841 variant positions finding 2 genomes.
2016-10-31 09:29:34,348:INFO:root:Set eta error transition matrix from = <open file '../variants_yay/outputtran_df.csv', mode 'r' at 0x7f77a68ef6f0>
2016-10-31 09:29:34,350:INFO:root:Selected 1000 random variant positions to infer haplotypes from
2016-10-31 09:29:34,373:INFO:root:Set second adjustable random seed = 0
2016-10-31 09:29:34,379:INFO:root:Perform NTF initialisation
2016-10-31 09:29:34,409:INFO:root:NTF Iter 0, div = 468.255157
2016-10-31 09:29:35,117:INFO:root:Start Gibbs sampler burn-in phase

I could reproduce this with making a subset of the first 10 lines of outputsel_var.csv:

,Position,Beef-A,Beef-C,Beef-G,Beef-T
COG0002,39.0,1.0,47.0,0.0,0.0
COG0002,71.0,54.0,1.0,0.0,0.0
COG0002,105.0,2.0,65.0,0.0,0.0
COG0002,120.0,0.0,73.0,0.0,1.0
COG0002,122.0,71.0,0.0,0.0,1.0
COG0002,155.0,0.0,1.0,71.0,0.0
COG0002,168.0,0.0,70.0,0.0,1.0
COG0002,172.0,1.0,0.0,72.0,0.0
COG0002,182.0,74.0,1.0,0.0,0.0

and outputtran_df.csv:

,0,1,2,3
0,0.999985528518,6.21737752731e-06,3.10868876366e-06,5.14541588467e-06
1,3.87447309856e-06,0.999991497684,9.68618274641e-07,3.65922459309e-06
2,4.20981015888e-06,1.09345718413e-06,0.999990541595,4.15513729968e-06
3,2.50252973114e-06,4.24341997889e-06,4.35222561938e-06,0.999988901825
chrisquince commented 7 years ago

Hi Fiona,

So yes I can what the problem is for some reason your variants file has floats in rather than ints. I can easily fix the code to accept floats and round them to ints but it may be good to figure out why this has happened?

I am more concerned that you only have one sample. Desman is not going to work for this data set.

Thanks, Chris

fwhelan commented 7 years ago

Hi Chris,

This is just a subset of the eventual samples that I can put through, but I am concerned that 4 of my samples are being dropped.

I seem to be getting slightly different outputs from bam-readcounts each time; I just re-ran it with 2 identical mappings of the KVLB sample (no differences in the _cov.txt files ) but I get differences in the resulting .cnts. Re-running VariantFilter now on one of these new .cnt files gives me 3 samples in my outputsel and no floats:

,Position,Beef-A,Beef-C,Beef-G,Beef-T,KVLB-A,KVLB-C,KVLB-G,KVLB-T,MAC-A,MAC-C,MAC-G,MAC-T
COG0002,419,59,0,0,0,414,3,0,0,297,3,0,0
COG0002,436,59,2,0,0,423,11,0,2,295,6,0,0
COG0002,440,0,0,0,58,0,4,0,438,0,2,0,293

I realize this makes me sound a bit crazy... I'm in the midsts of trying to reproduce this now, but if this turns out to not be related to DESMAN at all, I'm going to feel terrible!

fwhelan commented 7 years ago

I cannot reproduce this at all so I assume that this was my own error somewhere along the way. I'll reopen if I ever run into this issue again.

Multiple apologies for dragging you through! Thanks so much for your patience and help.