Closed nicoTR closed 7 years ago
Hi Nico,
Has the Cluster_scgs.freq file been properly generated? This should have base frequencies for each gene and each position in each sample. Can you attach the first 10,000 lines from it and I can try running?
Best, Chris
Hi Chris,
Here we go, I have no idea if the file was properly generated...
Cheers,
Nico
And I got a similar error with that one (another study).
Hi Nico,
The first file has not been generated properly. Have you tried the new script in the updated tutorial?
ExtractCountFreqGenes.py
It is more robust but it does need the count files zipped up though.
You second file looks fine though and was able to run Variant_Filter.py obtaining 246 variants. Are you using the current up to date of DESMAN I did fix some bugs last month.
Best, Chris
Ps to be clear they need to be gzipped.
Hi Chris,
Thanks! it worked ! The output of the heuristic approach gave me: 5,1,2,0.296568627451,ClusterMC_5_2/Filtered_Tau_star.csv. Do you know if there is a way to extract the core cogs seq for each of these haplotype (to make a phylogenetic tree for example)?
Another quick question, If you have "species" that have a very similar (>95%) genomes within a genus, I wonder if that could induce some issue with Concoct? I know that for the mapping it could be problematic.
Thanks a lot for your help and advice!
Nico
I am afraid not Nico, that output is actually predicting just one haplotype that we can be confident in, even though that was in the run with 5 haplotypes. Is this from the second data file? I will run it and check.
If the 'species' are different enough to be assembled onto different contigs then Concoct may still work assuming they are also differentially abundant. Mapping will be hard though.
Chris
Hey Chris,
Well no, it is from the 1st one where I used the last version of the different scripts (this time I did not get any errors, so I assume it worked).
Nico
Hi Nico,
Well the bad news is that I am not confident that you have multiple strains in the first genome. That organism is not very abundant across all samples so I had to reduce the minimum coverage threshold in Variant_Filter:
python /mnt/gpfs/chris/repos/DESMAN/desman/Variant_Filter.py Cluster_MC_scgs.freq -p -o Cluster_MC_scgs -m 1.0
Doing this predicted 138 variant positions. I then ran DESMAN and got the following result from resolvenhap with default parameters:
1,1,7,0.0014598540146,Cluster_MC_1_7/Filtered_Tau_star.csv
That means just one haplotype but if we raise the accuracy threshold to 15% for calling a strain then...
4,3,10,0.10097323601,Cluster_MC_4_10/Filtered_Tau_star.csv
i.e. 3 strains with error rates between 10-15%, so they may be valid they may not.
However, from your other sample, using default minimum coverage of 5 we confidently predict 4 haplotypes, personally I would focus on this organism:
4,4,10,0.0284693877551,Cluster_MC2_4_10/Filtered_Tau_star.csv
Tomorrow I will point you in the direction of some code to build a phylogenetic tree for it. I will post up the results for you too. I am just trying this organisms with a minimum coverage of 1,0 as well since that will give you a bit more info.
Hey Chris,
Thanks a lot! Well I think I should remove the samples with low abundance for these 'species' I guess. That would improve for sure the results as you showed. I'll try to do that for the 1st dataset.
For the second, that's funny cause I got a diff result (without playing with the parameter). I got 2,2,2,0.00994318181818. I'll redo it just in case!
Thanks for the code too!
Nico
Actually that only added one extra sample, so you can download my runs here:
https://nicotest.s3.climb.ac.uk/nicoTest.tar.gz
sorry it is a bit of a mess, the runs of sample 2 are in the top level dir and then sample 1 is in the subdir Cluster_MC1
so in the top level dir run...
python $DESMAN/scripts/resolvenhap.py Cluster_MC2
to get...
4,4,10,0.0284693877551,Cluster_MC2_4_10/Filtered_Tau_star.csv
The difference with your runs may be that I did extra iterations and ran 10 replicates. Look at my DESMAN.sh script in the top level dir...
for g in 1 2 3 4 5 6 7 8; do
for r in 0 1 2 3 4 5 6 7 8 9 10; do
desman Cluster_MC_scgs_2sel_var.csv -e Cluster_MC_scgs_2tran_df.csv -o ClusterMC2${g}${r} -i 1000 -g $g -s $r > ClusterMC2${g}_${r}.out&
done;
done
That is is probably advisable on real data.
Ok will redo that with more replicates and removing low abundant samples! Thanks Chris!
Just to be clear I was actually adding low abundance samples the default minimum coverage for a sample is 5.0. I am suggesting running:
-m 1.0
to reduce it one. Actually try that with desman too (I forget that) it may give you better results for the first data set. Remind me if I don't send a link to reconstructing the gene sequences tomorrow.
Best, Chris
I have a last question for you Chris, what do you think about these scores for Concoct output, pretty low I think. I was not sure the recall score was enough good. What do you think?
1st analysis
N M TL S K Rec. Prec. NMI Rand AdjRand 218721 3227 1.4142e+07 43 368 0.606240 0.984284 0.729601 0.847015 0.500359
2nd analysis
N M TL S K Rec. Prec. NMI Rand AdjRand 104823 2286 9.8707e+06 29 237 0.729150 0.962161 0.748194 0.853822 0.618765
The issue is that only a fraction of the contigs are classified. It is more useful in this case to look at the number of complete genomes you have from the SCG frequencies. Actually I have a complete CONCOCT/DESMAN here. It is not quite finished yet but it has the SCG evaluation...
Hey!
Oh ok I though the fact to work with higher coverage would give more confidence (about sample with low abundance). For the concoct output, ok I will do that, thanks a lot for all the advices!
Hi Chris,
How was the day? I am trying the StrainMetaSim (at least the part to make the SCG evaluation) but I don't find the Sort.pl script in CONCOCT... Whenever you can, I would be super interested by your code for phylogenetic tree, thanks a lot!!!
Cheers,
Nico
Hi Nico,
Busy as always. Trying to find the time for this now. Did you install the SpeedUp_Mp branch of CONCOCT i.e:
git clone git@github.com:BinPro/CONCOCT.git cd CONCOCT git fetch git checkout SpeedUp_Mp sudo python ./setup.py install
That should have the Sort.pl script. I am working on adding code to get core gene sequences to the StrainMetaSim repo right now.
Best, Chris
Hi Chris,
Yes I can imagine, sorry to bother you with that too! You plan to go to STAMPs this year?
I have installed the SpeedUp_Mp branch and get the Sort.pl script. Thanks!
Just got this error after concoct_refine:
concoct_refine clustering_gt1000_R.csv original_data_gt1000.csv clustering_gt1000_scg_sort.csv > concoct_ref.outTraceback (most recent call last):
File "/usr/local/bin/concoct_refine", line 4, in
Any idea? Thanks again for your help on this!
Try...
concoct_refine clustering_gt1000.csv original_data_gt1000.csv clustering_gt1000_scg_sort.csv
Sure will you be there? It would be nice to put a face to a very long github conversation!
Would have been great, but I will sample my favorite lake all the summer!
I tried and got: concoct_refine clustering_gt1000.csv original_data_gt1000.csv clustering_gt1000_scg_sort.csv Run CONCOCT for 18with 4.0clusters using 1threads 1063 34 -1201124664 Setting -1201124664 OMP threads Generate input data 0,-54551.614414,52.013563 1,-54535.015938,16.598476 2,-54536.088188,1.072250 3,-54534.873889,1.214299 4,-54530.255643,4.618246 5,-54530.250214,0.005429 6,-54530.250267,0.000053 Error in `/usr/bin/python': double free or corruption (!prev): 0x0000000005401570 (100 lines like ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fb6ba5127e5] /lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7fb6ba51ae0a] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fb6ba51e98c] /usr/local/lib/python2.7/dist-packages/concoct-0.4.1-py2.7-linux-x86_64.egg/vbgmm.so(destroyCluster+0x21)[0x7fb6b8686cf1] /usr/local/lib/python2.7/dist-packages/concoct-0.4.1-py2.7-linux-x86_64.egg/vbgmm.so(driverMP+0x4c4)[0x7fb6b868ca84] /usr/local/lib/python2.7/dist-packages/concoct-0.4.1-py2.7-linux-x86_64.egg/vbgmm.so(c_vbgmm_fit+0x35)[0x7fb6b868cc85] /usr/local/lib/python2.7/dist-packages/concoct-0.4.1-py2.7-linux-x86_64.egg/vbgmm.so(+0x7f6d)[0x7fb6b8683f6d]
======= Memory map: ======== 00400000-006ea000 r-xp 00000000 08:41 10354700 /usr/bin/python2.7 008e9000-008eb000 r--p 002e9000 08:41 10354700 /usr/bin/python2.7 008eb000-00962000 rw-p 002eb000 08:41 10354700 /usr/bin/python2.7
...
And the last word:
"Aborted"
Memory issue?
Nico
Oh dear, you can see the problem is garbled variables passed from python into C. It is just possible that you need to clean out the previous install of CONCOCT but I do not know how to do that. If you post up the three files for concoct_refine then I can test them on my implementation.
ps Where is your favourite lake?
OK I have added a new script to the repo that gets core gene sequences. It is untested as yet, I will do that tomorrow. Run as follows:
python $DESMAN/scripts/GetVariantsCore.py ../../Annotate/final_contigs_gt1000_c10K.fa Cluster85_core.cogs Filtered_Tau_star.csv coregenes.txt
where ../../Annotate/final_contigs_gt1000_c10K.fa are the contig sequences the core genes were called on.
Cluster85_core.cogs
Are core gene annotations with COGs in format:
COG0016,k141_47503,1599,2616,k141_47503_5,1
Filtered_Tau_star.csv are the DESMAN variant predictions
coregenes.txt is a list of core genes to generate the sequences of
Hope that is clear, I will add a use case to StrainMetaSim tomorrow and test them out, you can reverse sequences ('-r') which are reversed. You will need if you want to compare to reference sequences
Best, Chris
Hey!
Lake Champlain :) But we will finally meet next year for sure (in Exeter!).
Fantastic for the script thanks! Will try that!
Ok, I can try for the 3 files but one is pretty big so no idea if via GitHub there is any limit. If that's the case I will send you the file to your mail. Thanks!
Cheers,
Nico
Oh yes of course my apologies, it is hard sometimes to figure out who people are from github logins!
I used to spend a lot of time on Lake Opeongo all those Canadians lakes are lovely.
Tomorrow, I will track down that CONCOCT bug and then we can close this huge thread!
Well I did not present myself properly so my apologies! You are more than welcome if you want to visit us in Mtr after STAMPs. Champlain Lake will probably full of 'wonderful' cyano. The bug I think is linked with Numpy...
Nico
Hi Nico,
I love Cyano I just did a paper on them (or at least other oceanic nitrogen fixers) with Tom and Meren. Unfortunately my flight is already booked but next year definitely!
I have pushed a new version of that branch, that may fix your problem, can you try it out?
Best, Chris
Hi Chris,
Deal for next year then! I have tried the new branch and the issue is fixed thanks! Do you think I could jump to the DESMAN complete example but using the clustering_refine_scg.tsv ?
Have fun at STAMPs and say Hi to Meren and Amy for me!
Thanks a lot for your help again!
Nico
Hi Nico,
Do you mean your clustering_refine_scg.tsv or the one that I have provided as a download? In theory both should work but if any issues come up let me know.
Yes definitely! I would love to visit Lake Champlain. That is great, finally I get to close the thread :)
Best, Chris
Hi Chris,
How is STAMPs?
Yes I mean the clustering_refine_scg.tsv that I have generated. To be honest I am a little bit confused to how verify that the refine_file is better than the initial concoct output. Is there a way to validate that (recall value?) and then to split the new clusters based on the taxa of interest?
Thanks and hope you're have some fun there!
Cheers,
Nico
2017-08-01 6:11 GMT-04:00 chrisquince notifications@github.com:
Hi Nico,
Do you mean your clustering_refine_scg.tsv or the one that I have provided as a download? In theory both should work but if any issues come up let me know.
Yes definitely! I would love to visit Lake Champlain. That is great, finally I get to close the thread :)
Best, Chris
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chrisquince/DESMAN/issues/17#issuecomment-319329003, or mute the thread https://github.com/notifications/unsubscribe-auth/AHHR-G66bsLQQq0on8-b4-1J-zee8gD7ks5sTvnXgaJpZM4Oc95M .
--
Nicolas Tromas PhD Université de Montréal Département de sciences biologiques Microbial Evolutionary Genomics Group-Laboratoire de Jesse Shapiro Pavillon Marie-Victorin 90 Vincent-d'Indy, Montréal, Québec, H2V 2S9 Phone: (514) 343 6111 3188 E-mail: tromas.nicolas@gmail.com tromas.nicolas@gmail.com Web: http://www.shapirolab.ca/
Hi Nico,
Yes just count up the number of pure and complete genomes, which I arbitrarily define as those with 75% of scgs in single copy. In the StrainMetaSim tutorial I show you how to do that in R but you could write a script to do it:
scg_ref <- read.table("clustering_refine_scg.tsv",header=TRUE,row.names=1) scg_ref <- scg_ref[,-1] scg_ref <- scg_ref[,-1] sum(rowSums(scg_ref==1)/36 > 0.75)
Thanks, Chris
Hi Chris,
I have generated the Cluster_scgs.freq file (file size 4109754) and when I tried the variant_filter.py script I got this error, do you have an idea of what is going on? N is null? Thanks for you help on that!
Cheers,
Nico
_~/DESMAN-master/desman/Variant_Filter.py:340: RuntimeWarning: invalid value encountered in divide p = n/N ~/DESMAN-master/desman/Variant_Filter.py:341: RuntimeWarning: invalid value encountered in greater p[p > self.upperP] = self.upperP ~/DESMAN-master/desman/Variant_Filter.py:356: RuntimeWarning: invalid value encountered in less self.filtered = np.logical_or(N < self.Nthreshold,ratioNLL < self.threshold) /usr/local/lib/python2.7/dist-packages/scipy/stats/_distn_infrastructure.py:875: RuntimeWarning: invalid value encountered in greater return (self.a < x) & (x < self.b) /usr/local/lib/python2.7/dist-packages/scipy/stats/_distn_infrastructure.py:875: RuntimeWarning: invalid value encountered in less return (self.a < x) & (x < self.b) /usr/local/lib/python2.7/dist-packages/scipy/stats/_distn_infrastructure.py:1731: RuntimeWarning: invalid value encountered in greater_equal cond2 = (x >= self.b) & cond0 ~/DESMAN-master/desman/Variant_Filter.py:379: RuntimeWarning: invalid value encountered in greater self.filtered = np.logical_or(N < self.Nthreshold,self.qvalue > self.qvalue_cutoff)