FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
GNU General Public License v3.0
366 stars 101 forks source link

More than 29 millions rows in the coverage file #635

Closed demis001 closed 8 months ago

demis001 commented 8 months ago


Is there an easy way to represent a CpG with a single row in the "*.cov.gz " file for paired-end data?

Best, @demis001

FelixKrueger commented 8 months ago

Each row in the coverage file is a single position that was called as a CpG (single C resolution). If you wanted to merge the top and bottom strand Cs of a CpG dinucleotide, and relative back to the genome you can run coverage2cytosine --merge_CpG ...

demis001 commented 8 months ago

Does this out similar information "*cov.gz" file with a count of methylated and unmethylated reads but merged for both strands? What I am looking for is a count that summarizes each row as a CpG. Instead of C and G separate.

FelixKrueger commented 8 months ago

Yes, it will (use --help for more details):

  genome-wide CpG report (old)
  gi|9626372|ref|NC_001422.1|     157     +       313     156     CG
  gi|9626372|ref|NC_001422.1|     158     -       335     156     CG
  merged CpG evidence coverage file (new)
  gi|9626372|ref|NC_001422.1|     157     158     67.500000       648     312
demis001 commented 8 months ago

I will let you know after the test run is complete, the idea is to run multivariate analysis in the package like DSS and bsseq.

mkdir merged_coverage coverage2cytosine --merge_CpG --gzip --output merged_coverage --genome_folder /datamain/genome/hg38_r109/bismarkindx 184_S52_L003_R1_001_val_1_bismark_bt2_pe.deduplicated.bam


demis001 commented 8 months ago

I don't see the multi-tread option. Is this a single tread? I have 100 sampels

demis001 commented 8 months ago

It also shows a lot of error while running:

Use of uninitialized value within %chromosomes in pattern match (m//) at /home/ddjimamain/bin/Bismark-0.24.1/coverage2cytosine line 239, line 6840.

FelixKrueger commented 8 months ago

The input for coverage2cytosine is a coverage file (cov.gz), not a BAM file.