Closed timp0 closed 8 years ago
Thanks @timp0. I'll look into it and see if this can be implemented.
No problem - I have some small datasets of plasmid bisulfite sequencing which could be useful if you need a nice test set.
Test data would be great. I would really like that.
Would you prefer to be able to filter out partial coverage reads on a per-sample (chart) basis, or would it be sufficient to have a global setting that would toggle for all charts at once?
Here (https://www.dropbox.com/sh/yx645elb5yny036/AAB0k-P1vbVPsNMagqKFAApIa?dl=0) is a bam file and the reference fasta - it's of a plasmid that we did BS-seq on - I've also included regfile.txt which gives the regions of interest. The resulting html plots from methpat for OT and OB strand are also included
Global setting is fine for my application, but I suppose per sample is probably more flexible - are you thinking of a flag in the region file?
I was thinking of a control in the HTML visualisation which would toggle the display of partial reads.
But that would only affect the visualisation. Does that suit your use case?
Well - what I'm doing now is using an R script to filter the tsv output to get to a similar result. So - honestly I think a global flag for all output (tsv and html) is better, but of course up to you.
Would it be possible to upload the Bismark output to the dropbox folder?
Yup - it's in the TX20 directory in that same folder now
I've had a go at implementing this feature request.
The new code is in a branch called filter_unknowns
https://github.com/bjpop/methpat/tree/filter_unknowns
I've added a new optional command line argument to turn on the feature "--filterpartial".
You can use it like so:
methpat --filterpartial --amplicons regfile.txt --html example.html example_bismark.txt > example.results.txt
I would appreciate it if you have a chance to test it out.
Don't hesitate to ask if you need help getting the new code, or in trying it out.
Note that the order in which some of the patterns displayed in the HTML output might be slightly different when the unknowns are filtered, for patterns which have the same read count. However, the same set of patterns should be present, even if the ordering is slightly different. The ordering is a bit arbitrary when they have the same count.
Tested - results in dropbox as .pf. two issues:
1) Got a UTF-8 encode error from visualize.py, fixed by including a "# -- coding: utf-8 --" line at the top of that code
2) Larger issue, targreg3 in the OB strand doesn't show. I know why - if you look at that in the full reads, you'll see that one of the reads has a CG that no other read has - at 1379. This is not a real CG, it's generated by indels(or something) in the aligned read, and reported by bismark. I've run into this before - Felix says it's intentional because a CG generated by a SNP should be reported. As a result, all reads get filtered, because that read is missing a CG, and the other reads are all missing that 1379 CG.
Thanks @timp0
1) Can you please show the exact command that you used, and if possible, let me know which operating system you are using. I'm curious to see if I can reproduce it.
2) This is an interesting issue. Do you have any thoughts about how we might address it? Nothing simple immediately comes to mind. I'll keep thinking.
1) Yep - I ran as "~/Code/methpat/methpat/methpat.py --filterpartial --amplicons regfile.txt ${outdir}/${samp}/CpGOT${samp}*.txt --webassets online --logfile ${samp}.top.pf.methpat.log \ --html ${samp}.top.pf.methpat.html >${samp}.top.pf.methpat.tsv"
This is on Ubuntu 14.04 LTS with Python 2.7.6 (though I also have python 3 installed, it typically requires a specific call to python3, default python calls python2)
2) Yeah, so, I wrote some R code to filter and it works pretty well. I look for CGs that are present in at least 1% of reads (because this aberrant CG only appears at small frequency due to accuracy of Illumin). I then filter out CGs present at less than this frequency, then filter out reads which don't have all of the >1% covered CGs, and plot.
Thanks @timp0.
Sorry for my slow reply, have been a bit distracted by other work.
Your proposed approach sounds good. I'll have a go at implementing it soon. May be a bit slow due to being busy, but do let me know if this is holding up your work.
Hi @timp0
I've made an attempt at implementing your suggested 2) solution above.
The new code is pushed to the filter_unkowns branch on github.
I've added a new command line argument --min_cpg_percent PERCENTAGE
If you specify this argument then it will only consider CPG sites which appear in at least PERCENTAGE of reads for an amplicon.
This works on its own, or in conjunction with the --filterpartial argument already added.
I think this addresses the problem that you saw in targreg3.
Here is an example command line:
methpat --amplicons regfile.txt --min_cpg_percent 1 --filterpartial CpG_OB_TX20_S12_L001_R1_001_val_1.fq.gz_bismark_bt2_pe.txt > out.tsv
When you have time I would appreciate it if you could test it out and let me know how it goes.
Cheers!
hi @bjpop
There's a syntax error on my system "SyntaxError: invalid syntax File "/home/timp/Code/methpat/methpat/methpat.py", line 237
I think because you have a comment in the middle of the conditional, which my version of python (2.7.6 on ubuntu) doesn't seem to like. When I altered that, worked perfectly.
Sorry about that.
I've fixed in c84d10cda40ba4fef8874ea90aaec3071b534547
If that's all good with you I will merge to master and update the user docs.
Cheers!
It's great - thanks so much for working on this issue - I think it makes the code that much stronger personally, and definitely helps me!
No worries. I appreciate all the input you have provided. It was very helpful. I agree that it makes the code stronger.
I'll work towards merging it to master and fixing the docs.
Documents updated and merged to master.
Would it be possible to filter out reads with partial coverage - remove the reads that have "unknown" methylation status? It'd be useful for high-coverage BS-seq data sets where we want to plot specific regional patterns.