FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
367 stars 101 forks source link

cytosine_report only with data? #143

Closed DiegoZavallo closed 6 years ago

DiegoZavallo commented 6 years ago

Hi Felix, I wonder if there is a way to have the cytosine_report file from the coverage2cytosine function but only with the relevant information. I have amplicon sequencing data, therefore I only have a couple of amplicon disperse around the genome with real data. I'm using the DMRCaller software to analyze DMRs and the bismark file that it use is the cytosine_report file (CX) but as you know the files are very large (~1Gb) and my R session crash every time I start uploading the data. Is there a way to only have it with only the relevant data?

Best

Diego

FelixKrueger commented 6 years ago

Hi Diego,

As it stands the report is genome-wide so that you can have a matrix of all files that will share a common set of positions between them even if they not all covered in every sample.

I guess you have two options here.

  1. You write a quick script (or use a clever awk command) that will browse through the coverage report and discard positions that have 0 calls for both methylated or unmethylated cytosines.
  2. You change the coverage2cytosine code yourself so that it does that for you. I have done that some weeks ago for the output of NOMe-Seq data which really only reports positions if they were covered. If you locate the following lines:
else{
       if ($nome){ # added 20 Sept 2017
              unless ($meth + $nonmeth == 0){ # if the position was not covered it won't be reported
                     my $percentage = sprintf ("%.6f",$meth/ ($meth + $nonmeth) *100);
                     print CYT join ("\t",$last_chr,$pos,$strand,$meth,$nonmeth,$context,$tri_nt),"\n";
                     print CYTCOV join ("\t",$last_chr,$pos,$pos,$percentage,$meth,$nonmeth),"\n";
              }
       }
       else{
               print CYT join ("\t",$last_chr,$pos,$strand,$meth,$nonmeth,$context,$tri_nt),"\n";
       }
}

and add the same unless command to the default (i.e. not $nome) option you should get what you are looking for. Something like this:

else{
       unless ($meth + $nonmeth == 0){ # if the position was not covered it won't be reported
             print CYT join ("\t",$last_chr,$pos,$strand,$meth,$nonmeth,$context,$tri_nt),"\n";
       }
}

If you want to use C in CX context you might have to do the same thing a bit further down as well.

Let me know how you get on, I guess I could help you with the edits if that would help.

All the best, Felix

DiegoZavallo commented 6 years ago

Hi Felix, Thanks for your dedicated response. One of the options I was thinking is to parsed the output with awk, but changing the code of the coverage2cytosine function sounds really good. Unfortunately my bioinformatics skills are not that good, and I'm afraid to screw up. And even if I add the code that you wrote, how do I use it? Do I have to copy the script, rename it and copy into the PATH, or write the line and send the request to modify it? This is a screenshot of the changes...are there ok? screenshot from 2017-11-02 10-13-49

Do I sent the request? Do I have to wait for you to make the enhancement and then download the new version? It will came with an option to print only the covered positions?

In the meantime I will try to parsed the output with some scripting... Thanks again for all your help Best

Diego

FelixKrueger commented 6 years ago

Hi Diego,

Can you download the attached file, unzip it and try it on one of your files? I just tried it the following command:

./coverage2cytosine_covered_only --genome /bi/scratch/Genomes/E_coli/ ecoli.sar99-2016ORi12_S6_L001_R1_001_trimmed_bismark_bt2.bismark.cov.gz -o zzz_out

and it seemed to work just fine. coverage2cytosine_covered_only.zip

DiegoZavallo commented 6 years ago

Hi Felix!! It worked perfectly fine!! The only thing is that when you don't have any coverage in one of the chromosome (in my case in the ChrM) it will write all the non covered for that chromosome, but I just can delete those lines from the file. Almost at the same time I could manage to parse the original file, but I will definitely use your script for further analysis. Thanks so much!!! Diego

Storing all covered cytosine positions for chromosome: ChrC Writing cytosine report for chromosome ChrC (stored 83 different covered positions) Writing cytosine report for chromosome Chr5 (stored 142 different covered positions) Writing cytosine report for chromosome Chr3 (stored 774 different covered positions) Writing cytosine report for chromosome Chr4 (stored 89 different covered positions) Writing cytosine report for chromosome Chr1 (stored 1449 different covered positions) Writing cytosine report for last chromosome Chr2 (stored 80 different covered positions) Finished writing out cytosine report for covered chromosomes (processed 6 chromosomes/scaffolds in total)

Now processing chromosomes that were not covered by any methylation calls in the coverage file... Writing cytosine report for not covered chromosome ChrM Finished writing out cytosine report (processed 7 chromosomes/scaffolds in total). coverage2cytosine processing complete.

FelixKrueger commented 6 years ago

Oh yea, I completely forgot about those. If you locate the line that contains

   process_unprocessed_chromosomes($chr);    

and simply comment it out with a # it should be fine.

yongchao commented 5 years ago

Just to relive this issue. Would it be nice to modify the code base so that command coverage2cytosine can take an option like "--covergedonly"? We can always modify the code at our own local version, but the modification is going to be lost when we upgraded the version and we have to start over again.

FelixKrueger commented 5 years ago

I have tried to add a new option --coverage_threshold here: https://github.com/FelixKrueger/Bismark/issues/269

Can you give it a whirl and let me know if it works as expected?

DiegoZavallo commented 5 years ago

I didn't know that this option was added. I remember that Felix gave me a modified script called coverage2cytosine_covered_only which do the trick. Good to know that now is part of the options.

yongchao commented 5 years ago

I did the test with my own data and confirmed that It is working as expected. Thanks for adding this option.