chrisquince / DESMAN

De novo Extraction of Strains from MetAgeNomes
Other
69 stars 22 forks source link

Variant_Filter.py error #17

Closed nicoTR closed 7 years ago

nicoTR commented 7 years ago

Hi Chris,

I have generated the Cluster_scgs.freq file (file size 4109754) and when I tried the variant_filter.py script I got this error, do you have an idea of what is going on? N is null? Thanks for you help on that!

Cheers,

Nico

_~/DESMAN-master/desman/Variant_Filter.py:340: RuntimeWarning: invalid value encountered in divide p = n/N ~/DESMAN-master/desman/Variant_Filter.py:341: RuntimeWarning: invalid value encountered in greater p[p > self.upperP] = self.upperP ~/DESMAN-master/desman/Variant_Filter.py:356: RuntimeWarning: invalid value encountered in less self.filtered = np.logical_or(N < self.Nthreshold,ratioNLL < self.threshold) /usr/local/lib/python2.7/dist-packages/scipy/stats/_distn_infrastructure.py:875: RuntimeWarning: invalid value encountered in greater return (self.a < x) & (x < self.b) /usr/local/lib/python2.7/dist-packages/scipy/stats/_distn_infrastructure.py:875: RuntimeWarning: invalid value encountered in less return (self.a < x) & (x < self.b) /usr/local/lib/python2.7/dist-packages/scipy/stats/_distn_infrastructure.py:1731: RuntimeWarning: invalid value encountered in greater_equal cond2 = (x >= self.b) & cond0 ~/DESMAN-master/desman/Variant_Filter.py:379: RuntimeWarning: invalid value encountered in greater self.filtered = np.logical_or(N < self.Nthreshold,self.qvalue > self.qvalue_cutoff)

chrisquince commented 7 years ago

Hi Nico,

Has the Cluster_scgs.freq file been properly generated? This should have base frequencies for each gene and each position in each sample. Can you attach the first 10,000 lines from it and I can try running?

Best, Chris

nicoTR commented 7 years ago

Cluster_MC_scgs.freq.zip

Hi Chris,

Here we go, I have no idea if the file was properly generated...

Cheers,

Nico

nicoTR commented 7 years ago

Cluster_MC_scgs_2.freq.zip

And I got a similar error with that one (another study).

chrisquince commented 7 years ago

Hi Nico,

The first file has not been generated properly. Have you tried the new script in the updated tutorial?

ExtractCountFreqGenes.py

It is more robust but it does need the count files zipped up though.

You second file looks fine though and was able to run Variant_Filter.py obtaining 246 variants. Are you using the current up to date of DESMAN I did fix some bugs last month.

Best, Chris

chrisquince commented 7 years ago

Ps to be clear they need to be gzipped.

nicoTR commented 7 years ago

Hi Chris,

Thanks! it worked ! The output of the heuristic approach gave me: 5,1,2,0.296568627451,ClusterMC_5_2/Filtered_Tau_star.csv. Do you know if there is a way to extract the core cogs seq for each of these haplotype (to make a phylogenetic tree for example)?

Another quick question, If you have "species" that have a very similar (>95%) genomes within a genus, I wonder if that could induce some issue with Concoct? I know that for the mapping it could be problematic.

Thanks a lot for your help and advice!

Nico

chrisquince commented 7 years ago

I am afraid not Nico, that output is actually predicting just one haplotype that we can be confident in, even though that was in the run with 5 haplotypes. Is this from the second data file? I will run it and check.

If the 'species' are different enough to be assembled onto different contigs then Concoct may still work assuming they are also differentially abundant. Mapping will be hard though.

Chris

nicoTR commented 7 years ago

Cluster_MC_scgs.freq.zip

Dev.csv.zip

Hey Chris,

Well no, it is from the 1st one where I used the last version of the different scripts (this time I did not get any errors, so I assume it worked).

Nico

chrisquince commented 7 years ago

Hi Nico,

Well the bad news is that I am not confident that you have multiple strains in the first genome. That organism is not very abundant across all samples so I had to reduce the minimum coverage threshold in Variant_Filter:

python /mnt/gpfs/chris/repos/DESMAN/desman/Variant_Filter.py Cluster_MC_scgs.freq -p -o Cluster_MC_scgs -m 1.0

Doing this predicted 138 variant positions. I then ran DESMAN and got the following result from resolvenhap with default parameters:

1,1,7,0.0014598540146,Cluster_MC_1_7/Filtered_Tau_star.csv

That means just one haplotype but if we raise the accuracy threshold to 15% for calling a strain then...

4,3,10,0.10097323601,Cluster_MC_4_10/Filtered_Tau_star.csv

i.e. 3 strains with error rates between 10-15%, so they may be valid they may not.

However, from your other sample, using default minimum coverage of 5 we confidently predict 4 haplotypes, personally I would focus on this organism:

4,4,10,0.0284693877551,Cluster_MC2_4_10/Filtered_Tau_star.csv

Tomorrow I will point you in the direction of some code to build a phylogenetic tree for it. I will post up the results for you too. I am just trying this organisms with a minimum coverage of 1,0 as well since that will give you a bit more info.

nicoTR commented 7 years ago

Hey Chris,

Thanks a lot! Well I think I should remove the samples with low abundance for these 'species' I guess. That would improve for sure the results as you showed. I'll try to do that for the 1st dataset.

For the second, that's funny cause I got a diff result (without playing with the parameter). I got 2,2,2,0.00994318181818. I'll redo it just in case!

Thanks for the code too!

Nico

chrisquince commented 7 years ago

Actually that only added one extra sample, so you can download my runs here:

https://nicotest.s3.climb.ac.uk/nicoTest.tar.gz

sorry it is a bit of a mess, the runs of sample 2 are in the top level dir and then sample 1 is in the subdir Cluster_MC1

so in the top level dir run...

python $DESMAN/scripts/resolvenhap.py Cluster_MC2

to get...

4,4,10,0.0284693877551,Cluster_MC2_4_10/Filtered_Tau_star.csv

chrisquince commented 7 years ago

The difference with your runs may be that I did extra iterations and ran 10 replicates. Look at my DESMAN.sh script in the top level dir...

for g in 1 2 3 4 5 6 7 8; do
for r in 0 1 2 3 4 5 6 7 8 9 10; do
desman Cluster_MC_scgs_2sel_var.csv -e Cluster_MC_scgs_2tran_df.csv -o ClusterMC2${g}${r} -i 1000 -g $g -s $r > ClusterMC2${g}_${r}.out&
done; done

That is is probably advisable on real data.

nicoTR commented 7 years ago

Ok will redo that with more replicates and removing low abundant samples! Thanks Chris!

chrisquince commented 7 years ago

Just to be clear I was actually adding low abundance samples the default minimum coverage for a sample is 5.0. I am suggesting running:

-m 1.0

to reduce it one. Actually try that with desman too (I forget that) it may give you better results for the first data set. Remind me if I don't send a link to reconstructing the gene sequences tomorrow.

Best, Chris

nicoTR commented 7 years ago

I have a last question for you Chris, what do you think about these scores for Concoct output, pretty low I think. I was not sure the recall score was enough good. What do you think?

1st analysis

N M TL S K Rec. Prec. NMI Rand AdjRand 218721 3227 1.4142e+07 43 368 0.606240 0.984284 0.729601 0.847015 0.500359

2nd analysis

N M TL S K Rec. Prec. NMI Rand AdjRand 104823 2286 9.8707e+06 29 237 0.729150 0.962161 0.748194 0.853822 0.618765

chrisquince commented 7 years ago

The issue is that only a fraction of the contigs are classified. It is more useful in this case to look at the number of complete genomes you have from the SCG frequencies. Actually I have a complete CONCOCT/DESMAN here. It is not quite finished yet but it has the SCG evaluation...

https://github.com/chrisquince/StrainMetaSim

ntromas commented 7 years ago

Hey!

Oh ok I though the fact to work with higher coverage would give more confidence (about sample with low abundance). For the concoct output, ok I will do that, thanks a lot for all the advices!

nicoTR commented 7 years ago

Hi Chris,

How was the day? I am trying the StrainMetaSim (at least the part to make the SCG evaluation) but I don't find the Sort.pl script in CONCOCT... Whenever you can, I would be super interested by your code for phylogenetic tree, thanks a lot!!!

Cheers,

Nico

chrisquince commented 7 years ago

Hi Nico,

Busy as always. Trying to find the time for this now. Did you install the SpeedUp_Mp branch of CONCOCT i.e:

git clone git@github.com:BinPro/CONCOCT.git cd CONCOCT git fetch git checkout SpeedUp_Mp sudo python ./setup.py install

That should have the Sort.pl script. I am working on adding code to get core gene sequences to the StrainMetaSim repo right now.

Best, Chris

ntromas commented 7 years ago

Hi Chris,

Yes I can imagine, sorry to bother you with that too! You plan to go to STAMPs this year?

I have installed the SpeedUp_Mp branch and get the Sort.pl script. Thanks!

Just got this error after concoct_refine:

concoct_refine clustering_gt1000_R.csv original_data_gt1000.csv clustering_gt1000_scg_sort.csv > concoct_ref.outTraceback (most recent call last):

File "/usr/local/bin/concoct_refine", line 4, in import('pkg_resources').run_script('concoct==0.4.1', 'concoct_refine') File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 739, in run_script self.require(requires)[0].run_script(script_name, ns) File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 1500, in run_script exec(code, namespace, namespace) File "/usr/local/lib/python2.7/dist-packages/concoct-0.4.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/concoct_refine", line 85, in main(sys.argv[1:]) File "/usr/local/lib/python2.7/dist-packages/concoct-0.4.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/concoct_refine", line 63, in main slice_k = original_data_matrix[select[:,0],:] IndexError: boolean index did not match indexed array along dimension 0; dimension is 218721 but corresponding boolean dimension is 218720

Any idea? Thanks again for your help on this!

clustering_gt1000.zip

chrisquince commented 7 years ago

Try...

concoct_refine clustering_gt1000.csv original_data_gt1000.csv clustering_gt1000_scg_sort.csv

Sure will you be there? It would be nice to put a face to a very long github conversation!

ntromas commented 7 years ago

Would have been great, but I will sample my favorite lake all the summer!

I tried and got: concoct_refine clustering_gt1000.csv original_data_gt1000.csv clustering_gt1000_scg_sort.csv Run CONCOCT for 18with 4.0clusters using 1threads 1063 34 -1201124664 Setting -1201124664 OMP threads Generate input data 0,-54551.614414,52.013563 1,-54535.015938,16.598476 2,-54536.088188,1.072250 3,-54534.873889,1.214299 4,-54530.255643,4.618246 5,-54530.250214,0.005429 6,-54530.250267,0.000053 Error in `/usr/bin/python': double free or corruption (!prev): 0x0000000005401570 (100 lines like ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fb6ba5127e5] /lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7fb6ba51ae0a] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fb6ba51e98c] /usr/local/lib/python2.7/dist-packages/concoct-0.4.1-py2.7-linux-x86_64.egg/vbgmm.so(destroyCluster+0x21)[0x7fb6b8686cf1] /usr/local/lib/python2.7/dist-packages/concoct-0.4.1-py2.7-linux-x86_64.egg/vbgmm.so(driverMP+0x4c4)[0x7fb6b868ca84] /usr/local/lib/python2.7/dist-packages/concoct-0.4.1-py2.7-linux-x86_64.egg/vbgmm.so(c_vbgmm_fit+0x35)[0x7fb6b868cc85] /usr/local/lib/python2.7/dist-packages/concoct-0.4.1-py2.7-linux-x86_64.egg/vbgmm.so(+0x7f6d)[0x7fb6b8683f6d]

======= Memory map: ======== 00400000-006ea000 r-xp 00000000 08:41 10354700 /usr/bin/python2.7 008e9000-008eb000 r--p 002e9000 08:41 10354700 /usr/bin/python2.7 008eb000-00962000 rw-p 002eb000 08:41 10354700 /usr/bin/python2.7

...

And the last word:

"Aborted"

Memory issue?

Nico

chrisquince commented 7 years ago

Oh dear, you can see the problem is garbled variables passed from python into C. It is just possible that you need to clean out the previous install of CONCOCT but I do not know how to do that. If you post up the three files for concoct_refine then I can test them on my implementation.

chrisquince commented 7 years ago

ps Where is your favourite lake?

chrisquince commented 7 years ago

OK I have added a new script to the repo that gets core gene sequences. It is untested as yet, I will do that tomorrow. Run as follows:

python $DESMAN/scripts/GetVariantsCore.py ../../Annotate/final_contigs_gt1000_c10K.fa Cluster85_core.cogs Filtered_Tau_star.csv coregenes.txt

where ../../Annotate/final_contigs_gt1000_c10K.fa are the contig sequences the core genes were called on.

Cluster85_core.cogs

Are core gene annotations with COGs in format:

COG0016,k141_47503,1599,2616,k141_47503_5,1

Filtered_Tau_star.csv are the DESMAN variant predictions

coregenes.txt is a list of core genes to generate the sequences of

Hope that is clear, I will add a use case to StrainMetaSim tomorrow and test them out, you can reverse sequences ('-r') which are reversed. You will need if you want to compare to reference sequences

Best, Chris

nicoTR commented 7 years ago

Hey!

Lake Champlain :) But we will finally meet next year for sure (in Exeter!).

Fantastic for the script thanks! Will try that!

Ok, I can try for the 3 files but one is pretty big so no idea if via GitHub there is any limit. If that's the case I will send you the file to your mail. Thanks!

Cheers,

Nico

chrisquince commented 7 years ago

Oh yes of course my apologies, it is hard sometimes to figure out who people are from github logins!

I used to spend a lot of time on Lake Opeongo all those Canadians lakes are lovely.

Tomorrow, I will track down that CONCOCT bug and then we can close this huge thread!

ntromas commented 7 years ago

Well I did not present myself properly so my apologies! You are more than welcome if you want to visit us in Mtr after STAMPs. Champlain Lake will probably full of 'wonderful' cyano. The bug I think is linked with Numpy...

Nico

chrisquince commented 7 years ago

Hi Nico,

I love Cyano I just did a paper on them (or at least other oceanic nitrogen fixers) with Tom and Meren. Unfortunately my flight is already booked but next year definitely!

I have pushed a new version of that branch, that may fix your problem, can you try it out?

Best, Chris

nicoTR commented 7 years ago

Hi Chris,

Deal for next year then! I have tried the new branch and the issue is fixed thanks! Do you think I could jump to the DESMAN complete example but using the clustering_refine_scg.tsv ?

Have fun at STAMPs and say Hi to Meren and Amy for me!

Thanks a lot for your help again!

Nico

chrisquince commented 7 years ago

Hi Nico,

Do you mean your clustering_refine_scg.tsv or the one that I have provided as a download? In theory both should work but if any issues come up let me know.

Yes definitely! I would love to visit Lake Champlain. That is great, finally I get to close the thread :)

Best, Chris

ntromas commented 7 years ago

Hi Chris,

How is STAMPs?

Yes I mean the clustering_refine_scg.tsv that I have generated. To be honest I am a little bit confused to how verify that the refine_file is better than the initial concoct output. Is there a way to validate that (recall value?) and then to split the new clusters based on the taxa of interest?

Thanks and hope you're have some fun there!

Cheers,

Nico

2017-08-01 6:11 GMT-04:00 chrisquince notifications@github.com:

Hi Nico,

Do you mean your clustering_refine_scg.tsv or the one that I have provided as a download? In theory both should work but if any issues come up let me know.

Yes definitely! I would love to visit Lake Champlain. That is great, finally I get to close the thread :)

Best, Chris

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chrisquince/DESMAN/issues/17#issuecomment-319329003, or mute the thread https://github.com/notifications/unsubscribe-auth/AHHR-G66bsLQQq0on8-b4-1J-zee8gD7ks5sTvnXgaJpZM4Oc95M .

--


Nicolas Tromas PhD Université de Montréal Département de sciences biologiques Microbial Evolutionary Genomics Group-Laboratoire de Jesse Shapiro Pavillon Marie-Victorin 90 Vincent-d'Indy, Montréal, Québec, H2V 2S9 Phone: (514) 343 6111 3188 E-mail: tromas.nicolas@gmail.com tromas.nicolas@gmail.com Web: http://www.shapirolab.ca/


chrisquince commented 7 years ago

Hi Nico,

Yes just count up the number of pure and complete genomes, which I arbitrarily define as those with 75% of scgs in single copy. In the StrainMetaSim tutorial I show you how to do that in R but you could write a script to do it:

scg_ref <- read.table("clustering_refine_scg.tsv",header=TRUE,row.names=1) scg_ref <- scg_ref[,-1] scg_ref <- scg_ref[,-1] sum(rowSums(scg_ref==1)/36 > 0.75)

Thanks, Chris