FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
382 stars 101 forks source link

Creating methylation file #611

Closed husain223 closed 1 year ago

husain223 commented 1 year ago

Hello, I have a question about generating a methylation file in a specific format. I've completed all the necessary steps, starting from genome preparation, methylation extraction, and using bismarkTobedgraph. Now, I'm looking to transform the methylation calls into the desired file format:

Screenshot 2023-08-01 at 2 58 35 PM

Column 1 to Column 3: Genomic coordinates. Column 4: Indicating the methylation. Column 5: The corresponding methylated positions on the read. I am seeking guidance on which specific arguments I should provide at a certain step in order to obtain the desired output file. If necessary, I am open to writing a script to generate the file, and in this case, I would appreciate any suggestions on which file would be most suitable for this purpose.

FelixKrueger commented 1 year ago

So you would like to create a file with one line per read, where the methylation is shown as 0 or 1 (in a comma separated form in Col 4, with the relative position within the read in Col 5?

I am afraid you would have to write a script to this. You might be able to use a trick that it already part of the methylation extraction process, followed by something similar to the NOMe-filtering step (https://github.com/FelixKrueger/Bismark/blob/7288cb6e99db92ceb82d2143bce588affe7335ff/NOMe_filtering).

In short, you would have to re-run the methylation extractor using the option --yacht option, which will add a few additional columns to the CpG etc. output files, in this format:

  <seq-ID>  <methylation state*>  <chromosome>  <start position (= end position)>  <methylation call>  <read start>  <read end>  <read orientation>

Importantly, in this way each methylation call in the CpG file retains the read information of the entire read (start, end, orientation), which will then allow you so derive the relative position of the C for the methylation call (column 5 in your example).

You will need to modify the NOMe-filtering script to not filter the NOMe-seq context, but rather just right out the positions into a comma-separated variable. All reads from a given read ID are already read in, it should really be fairly easy to accomplish what you need.

All the best! Felix

husain223 commented 1 year ago

Thanks i tried to give --yacht argument but it was complaining that the data should be in single end not in paired end

FelixKrueger commented 1 year ago

Yes, I am afraid this was only ever implemented in single-end mode. I suppose you could use the option -s to force single-end mode to see if that works in the first instance?

husain223 commented 1 year ago

Yep, it works, but using the "-s" option for paired-end data might loss some methylation details or end up with multiple methylation signals for regions that overlap.

husain223 commented 1 year ago

Yep, it works, but using the "-s" option for paired-end data might loss some methylation details or end up with multiple methylation signals for regions that overlap.

FelixKrueger commented 1 year ago

True that, I found this in the help-text:

...Its intended use is single-cell NOMe-Seq data, and thus this option works only in single-end mode (paired-end reads often suffer rom chimaera problems...)

If this is essential for your project I am afraid you would need to find a way to extend this to paired-end coordinates somewhow.