bjpop / methpat

A program for summarising CpG methylation patterns
Other
19 stars 5 forks source link

Possible to filter reads with partial coverage #24

Closed timp0 closed 8 years ago

timp0 commented 8 years ago

Would it be possible to filter out reads with partial coverage - remove the reads that have "unknown" methylation status? It'd be useful for high-coverage BS-seq data sets where we want to plot specific regional patterns.

bjpop commented 8 years ago

Thanks @timp0. I'll look into it and see if this can be implemented.

timp0 commented 8 years ago

No problem - I have some small datasets of plasmid bisulfite sequencing which could be useful if you need a nice test set.

bjpop commented 8 years ago

Test data would be great. I would really like that.

Would you prefer to be able to filter out partial coverage reads on a per-sample (chart) basis, or would it be sufficient to have a global setting that would toggle for all charts at once?

timp0 commented 8 years ago

Here (https://www.dropbox.com/sh/yx645elb5yny036/AAB0k-P1vbVPsNMagqKFAApIa?dl=0) is a bam file and the reference fasta - it's of a plasmid that we did BS-seq on - I've also included regfile.txt which gives the regions of interest. The resulting html plots from methpat for OT and OB strand are also included

Global setting is fine for my application, but I suppose per sample is probably more flexible - are you thinking of a flag in the region file?

bjpop commented 8 years ago

I was thinking of a control in the HTML visualisation which would toggle the display of partial reads.

But that would only affect the visualisation. Does that suit your use case?

timp0 commented 8 years ago

Well - what I'm doing now is using an R script to filter the tsv output to get to a similar result. So - honestly I think a global flag for all output (tsv and html) is better, but of course up to you.

bjpop commented 8 years ago

Would it be possible to upload the Bismark output to the dropbox folder?

timp0 commented 8 years ago

Yup - it's in the TX20 directory in that same folder now

bjpop commented 8 years ago

I've had a go at implementing this feature request.

The new code is in a branch called filter_unknowns

https://github.com/bjpop/methpat/tree/filter_unknowns

I've added a new optional command line argument to turn on the feature "--filterpartial".

You can use it like so:

methpat --filterpartial --amplicons regfile.txt --html example.html example_bismark.txt > example.results.txt

I would appreciate it if you have a chance to test it out.

Don't hesitate to ask if you need help getting the new code, or in trying it out.

Note that the order in which some of the patterns displayed in the HTML output might be slightly different when the unknowns are filtered, for patterns which have the same read count. However, the same set of patterns should be present, even if the ordering is slightly different. The ordering is a bit arbitrary when they have the same count.

timp0 commented 8 years ago

Tested - results in dropbox as .pf. two issues:

1) Got a UTF-8 encode error from visualize.py, fixed by including a "# -- coding: utf-8 --" line at the top of that code

2) Larger issue, targreg3 in the OB strand doesn't show. I know why - if you look at that in the full reads, you'll see that one of the reads has a CG that no other read has - at 1379. This is not a real CG, it's generated by indels(or something) in the aligned read, and reported by bismark. I've run into this before - Felix says it's intentional because a CG generated by a SNP should be reported. As a result, all reads get filtered, because that read is missing a CG, and the other reads are all missing that 1379 CG.

bjpop commented 8 years ago

Thanks @timp0

1) Can you please show the exact command that you used, and if possible, let me know which operating system you are using. I'm curious to see if I can reproduce it.

2) This is an interesting issue. Do you have any thoughts about how we might address it? Nothing simple immediately comes to mind. I'll keep thinking.

timp0 commented 8 years ago

1) Yep - I ran as "~/Code/methpat/methpat/methpat.py --filterpartial --amplicons regfile.txt ${outdir}/${samp}/CpGOT${samp}*.txt --webassets online --logfile ${samp}.top.pf.methpat.log \ --html ${samp}.top.pf.methpat.html >${samp}.top.pf.methpat.tsv"

This is on Ubuntu 14.04 LTS with Python 2.7.6 (though I also have python 3 installed, it typically requires a specific call to python3, default python calls python2)

2) Yeah, so, I wrote some R code to filter and it works pretty well. I look for CGs that are present in at least 1% of reads (because this aberrant CG only appears at small frequency due to accuracy of Illumin). I then filter out CGs present at less than this frequency, then filter out reads which don't have all of the >1% covered CGs, and plot.

bjpop commented 8 years ago

Thanks @timp0.

Sorry for my slow reply, have been a bit distracted by other work.

Your proposed approach sounds good. I'll have a go at implementing it soon. May be a bit slow due to being busy, but do let me know if this is holding up your work.

bjpop commented 8 years ago

Hi @timp0

I've made an attempt at implementing your suggested 2) solution above.

The new code is pushed to the filter_unkowns branch on github.

I've added a new command line argument --min_cpg_percent PERCENTAGE

If you specify this argument then it will only consider CPG sites which appear in at least PERCENTAGE of reads for an amplicon.

This works on its own, or in conjunction with the --filterpartial argument already added.

I think this addresses the problem that you saw in targreg3.

Here is an example command line:

methpat --amplicons regfile.txt --min_cpg_percent 1 --filterpartial CpG_OB_TX20_S12_L001_R1_001_val_1.fq.gz_bismark_bt2_pe.txt > out.tsv

When you have time I would appreciate it if you could test it out and let me know how it goes.

Cheers!

timp0 commented 8 years ago

hi @bjpop

There's a syntax error on my system "SyntaxError: invalid syntax File "/home/timp/Code/methpat/methpat/methpat.py", line 237

If args.min_cpg_percent is set, only consider CPG sites which appear in"

I think because you have a comment in the middle of the conditional, which my version of python (2.7.6 on ubuntu) doesn't seem to like. When I altered that, worked perfectly.

bjpop commented 8 years ago

Sorry about that.

I've fixed in c84d10cda40ba4fef8874ea90aaec3071b534547

If that's all good with you I will merge to master and update the user docs.

Cheers!

timp0 commented 8 years ago

It's great - thanks so much for working on this issue - I think it makes the code that much stronger personally, and definitely helps me!

bjpop commented 8 years ago

No worries. I appreciate all the input you have provided. It was very helpful. I agree that it makes the code stronger.

I'll work towards merging it to master and fixing the docs.

bjpop commented 8 years ago

Documents updated and merged to master.