FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
394 stars 103 forks source link

Generating cytosine report #655

Closed desmodus1984 closed 8 months ago

desmodus1984 commented 9 months ago

Hi,

I am studying differential methylation in bumblebees, and after using methylKit to find differentially methylated sites it seems better to find differentially methylated regions. I wanted to find DMR using the software swDMR (unless you suggest a different/better one) and it needs as input the cytosine report with strand and counts.

I am contacting you because for methylKit I have the bam files for methylation calling but I did pre-process them in a different way, not with Bismark. I trimmed the reads based on the suggested EM-seq parameters, and I checked for methylation-bias with Methyldackel with regular trimming parameters. After confirming best trimming parameters, I aligned the trimmed reads with bwa-meth, sam-to-bam conversion, and deduplication/elimination with Picard.

I wanted to ask if I can use the bam files that I have already generated to make the cytosine report or if I should re-processed my files all over again with Bismark.

Thank you again.

FelixKrueger commented 9 months ago

I am afraid I don't think the BAM files are compatible :(, and you would indeed have to re-start from the alignment step...

desmodus1984 commented 9 months ago

Good morning,

I installed Bismark as mentioned. When I tried doing the genome indexing it failed

home/juaguila/appz/Bismark-0.24.2/bismark_genome_preparation ../Bismark/ Writing bisulfite genomes out into a single MFA (multi FastA) file

Bisulfite Genome Indexer version v0.24.2 (last modified: 19 May 2022)

Step I - Prepare genome folders - completed

Total number of conversions performed: C->T: 52215405 G->A: 52297904

Step II - Genome bisulfite conversions - completed

Bismark Genome Preparation - Step III: Launching the Bowtie 2 indexer Please be aware that this process can - depending on genome size - take several hours! /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: versionGLIBCXX_3.4.20' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: version CXXABI_1.3.8' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: versionGLIBCXX_3.4.26' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: version GLIBCXX_3.4.30' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: versionGLIBCXX_3.4.22' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) Traceback (most recent call last): File "/home/applications/bowtie2/2.5.3/bowtie2-build", line 134, in main() File "/home/applications/bowtie2/2.5.3/bowtie2-build", line 68, in main if sais_enabled(build_bin_spec): File "/home/applications/bowtie2/2.5.3/bowtie2-build", line 42, in sais_enabled output = subprocess.check_output([build_bin_spec, "--version"], universal_newlines=True) File "/usr/lib64/python3.6/subprocess.pyhttp://subprocess.py/", line 356, in check_output **kwargs).stdout File "/usr/lib64/python3.6/subprocess.pyhttp://subprocess.py/", line 438, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['/home/applications/bowtie2/2.5.3/bowtie2-build-s', '--version']' returned non-zero exit status 1. Parent process: Failed to build index


One option that I am doubtful is that bowtie2 is installed in the HPC as a module and I loaded. Do I have to figure out the path of bowtie2 or it is obtained automatically?

Thanks,

Juan Pablo Aguilar


From: Felix Krueger @.> Sent: Tuesday, February 20, 2024 5:12:21 AM To: FelixKrueger/Bismark @.> Cc: Aguilar Cabezas, Juan Pablo @.>; Author @.> Subject: [External] Re: [FelixKrueger/Bismark] Generating cytosine report (Issue #655)

Use caution with links and attachments.

I am afraid I don't think the BAM files are compatible :(, and you would indeed have to re-start from the alignment step...

— Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/655#issuecomment-1953881646, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJWD2VJMA5CRM7P4JVOH74DYURZILAVCNFSM6AAAAABDP7LTP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJTHA4DCNRUGY. You are receiving this because you authored the thread.Message ID: @.***>

FelixKrueger commented 9 months ago

Hmm, this does indeed look like a Bowtie2 installation issue. If you don't specify the path yourself it will look in the PATH environment variable. It seems to use it from /home/applications/bowtie2/2.5.3/bowtie2-build.

You can find out the path with the command which bowtie2.

When you run /home/applications/bowtie2/2.5.3/bowtie2-build --help you should be able to see the help text. Otherwise, I would recommend you reintall Bowtie2 (or install Bismark using conda install bismark, which will also satisfy all other requirements)

desmodus1984 commented 9 months ago

I did actually think about that but the version in conda is 0.22 while the last one is 0.24.

Since I was told to best use the latest version due to bugs and updates.


From: Felix Krueger @.> Sent: Wednesday, February 21, 2024 10:20:32 AM To: FelixKrueger/Bismark @.> Cc: Aguilar Cabezas, Juan Pablo @.>; Author @.> Subject: [External] Re: [FelixKrueger/Bismark] Generating cytosine report (Issue #655)

Use caution with links and attachments.

Hmm, this does indeed look like a Bowtie2 installation issue. If you don't specify the path yourself it will look in the PATH environment variable. It seems to use it from /home/applications/bowtie2/2.5.3/bowtie2-build.

You can find out the path with the command which bowtie2.

When you run /home/applications/bowtie2/2.5.3/bowtie2-build --help you should be able to see the help text. Otherwise, I would recommend you reintall Bowtie2 (or install Bismark using conda install bismark, which will also satisfy all other requirements)

— Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/655#issuecomment-1956921278, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJWD2VIZAZWLUMOZO6BR2HTYUYGEBAVCNFSM6AAAAABDP7LTP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJWHEZDCMRXHA. You are receiving this because you authored the thread.Message ID: @.***>

FelixKrueger commented 9 months ago

The latest version on bioconda is 0.24.2, the instructions are here: https://bioconda.github.io/recipes/bismark/README.html

desmodus1984 commented 9 months ago

Hi,

I Installed Bismark in conda, and I am trying to run the workflow and I would appreciate if you could check my code, and tell me what I am writing wrong or missing.

I loaded the environment, and I am running the code in the folder, as an .sh file, and it is not creating any bam output. Here is my code which worked with bs_seeker2.

I start with a loop to grab the names of the files, but then alignment doesn't happen.

`for i in *_R1_val_1.fq.gz do base=$(basename $i "_R1_val_1.fq.gz")

    bismark -q --parallel 8  -B ${base} \
    --genome /DataDrive/juaguila/Bvos-trimmed/BM-lot/ \
     -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz 

    deduplicate_bismark --bam ${base}.bam

    bismark_methylation_extractor -p --gzip --genome_folder /DataDrive/juaguila/Bvos-trimmed/BM-lot/ \
    --cytosine_report --multicore 8  ${base}.deduplicated.bam

    done

`

Which I run this: nohup sh BS2.sh > test.bismark.log

The genome was created in the genome directory.

The log is as follows:

Bowtie 2 seems to be working fine (tested command 'bowtie2 --version' [2.4.1]) Output format is BAM (default) Alignments will be written out in BAM format. Samtools found here: '/home/juaguila/miniconda3/envs/bismark/bin/samtools' Reference genome folder provided is /DataDrive/juaguila/Bvos-trimmed/BM-lot/ (absolute path is '/DataDrive/juaguila/Bvos-trimmed/BM-lot/)' FastQ format specified

Input files to be analysed (in current folder '/DataDrive/juaguila/Bvos-trimmed/BM-lot'): V00001_R1_val_1.fq.gz V00001_R2_val_2.fq.gz Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!) Specifying --basename in conjuction with --multicore is currently not supported (but we are aiming to fix this soon). Please lose either --basename or --multicore to proceed

Neither -s (single-end) nor -p (paired-end) selected for deduplication. Trying to extract this information for each file separately from the @PG line of the SAM/BAM file Processing single-end Bismark output file(s) (SAM format): V00001.bam

If there are several alignments to a single position in the genome the first alignment will be chosen. Since the input files are not in any way sorted this is a near-enough random selection of reads.

Checking file >>V00001.bam<< for signs of file truncation... Captured error message: '[E::hts_open_format] Failed to open file V00001.bam'

[ERROR] The file appears to be truncated, please ensure that there were no errors while copying the file!!! Exiting...

I would appreciate if you could tell me what parameter I am missing

FelixKrueger commented 9 months ago

The error here is this:

Specifying --basename in conjuction with --multicore is currently not supported (but we are aiming to fix this soon). Please lose either --basename or --multicore to proceed

So the Bismark could simply be:

bismark --parallel 8 --genome /DataDrive/juaguila/Bvos-trimmed/BM-lot/ -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz 

This is fine:

deduplicate_bismark ${base}.bam

The following command will need --bedGraph I believe:

bismark_methylation_extractor --gzip --bedGraph --genome_folder /DataDrive/juaguila/Bvos-trimmed/BM-lot/ --cytosine_report --multicore 8  ${base}.deduplicated.bam
desmodus1984 commented 9 months ago

Thanks, I ran the code and I made one change, in addition to eliminating the base (-B) parameter, I changed multicore to parallel. Could you explain what is the difference? I thought they are the same, but maybe I am wrong.

Here is the code

for i in *_R1_val_1.fq.gz
        do
        base=$(basename $i "_R1_val_1.fq.gz")
        bismark -q --parallel 8 \
        --genome /home/juaguila/Bismark/ \
         -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz
        deduplicate_bismark --bam ${base}.bam
        bismark_methylation_extractor -p --gzip --genome_folder /DataDrive/juaguila/Bvos-trimmed/BM-lot/ \
        --cytosine_report --multicore 8  ${base}.deduplicated.bam
        done

I have a question too. I thought that the basename (${base}) would work well, and I would have a file called whatever is before *_R1_val_1.fq.gz like V00001.bam, as it did when I ran bs-seeker and cgmaptools. Contrarily, iInstead it created files with the entire XXXX_R1_val_1.fq.gz suffix

[juaguila@v001 BSMK-1]$ cd ../BSMK-0/ [juaguila@v001 BSMK-0]$ ls -ltr total 103335076 -rw-r--r-- 1 juaguila hpc_jfierst 3999335475 Jun 8 2023 V00001_R2_val_2.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 3675649845 Jun 8 2023 V00001_R1_val_1.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 3436679091 Jun 8 2023 V00021_R2_val_2.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 3104862452 Jun 8 2023 V00021_R1_val_1.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 3079178155 Jun 8 2023 V00294_R2_val_2.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 2823807574 Jun 8 2023 V00294_R1_val_1.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 457 Mar 3 21:49 BSMK-test.sh -rw-r--r-- 1 juaguila hpc_jfierst 7654589585 Mar 4 06:55 V00001_R1_val_1_bismark_bt2_pe.bam -rw-r--r-- 1 juaguila hpc_jfierst 1867 Mar 4 06:55 V00001_R1_val_1_bismark_bt2_PE_report.txt -rw-r--r-- 1 juaguila hpc_jfierst 6195867738 Mar 4 11:14 V00021_R1_val_1_bismark_bt2_pe.bam -rw-r--r-- 1 juaguila hpc_jfierst 1868 Mar 4 11:14 V00021_R1_val_1_bismark_bt2_PE_report.txt

Did I write something wrong? I thought that the code was specifying just the XXX part; that is why I used the -B base parameter to specify the file name, and then create the loop/sequence for the other codes. I thought that the alignment would write a bam file called XXX.bam and not a XXX_R1_val_1.fq.gz file.

FelixKrueger commented 8 months ago

The options --parallel/--multicore are identical.

I believe for the output files, only the ending is stripped off (e.g. .fq.gz). You can always rename files afterwards so they are to your liking, e.g.

rename s/_R1_val_1_bismark_/_/ *

The file XXX_R1_val_1.fq.gz is the input file (and is not created by Bismark...).

desmodus1984 commented 8 months ago

Good evening,

I followed your code for the methylation extractor, and I didn't get the bedgraph ouput nor the genome-wide methylation report files, which I require to do the detection of differentially methylated regions.

The code was:

for i in *_R1_val_1.fq.gz
        do
        base=$(basename $i "_R1_val_1.fq.gz")

        bismark -q --parallel 10 \
        --genome /home/juaguila/Bismark/ \
         -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz

        deduplicate_bismark -p --bam ${base}_R1_val_1_bismark_bt2_pe.bam -o ${base}

        bismark_methylation_extractor -p --gzip --bedGraph --buffer_size 20G --genome_folder /home/juaguila/Bismark/ \
        --cytosine_report --parallel 10  ${base}.deduplicated.bam

done

The log tells this error:

Methylation information will now be written into a bedGraph and coverage file

Using the following files as Input: /home/juaguila/BombusMethylSeq/Rec-5/BSMK-2/CpG_OT_V00747.deduplicated.txt.gz /home/juaguila/BombusMethylSeq/Rec-5/BSMK-2/CpG_OB_V00747.deduplicated.txt.gz

Writing bedGraph to file: V00747.deduplicated.bedGraph.gz Also writing out a coverage file including counts methylated and unmethylated residues to file: V00747.deduplicated.bismark.cov.gz

Now writing methylation information for file >>CpG_OT_V00747.deduplicated.txt.gz<< to individual files for each chromosome Failed to open filehandle: Too many open files at /home/juaguila/.conda/envs/bismark/bin/bismark2bedGraph line 279, line 276304. Finished BedGraph conversion ...


And for the cytosine report.

Methylation information will now be written into a genome-wide cytosine report

Adding context-specific methylation summaries

Writing genome-wide cytosine report to: V00314.deduplicated.CpG_report.txt.gz <<<

Writing all cytosine context summary file to: V00314.deduplicated.cytosine_context_summary.txt <<<

No last chromosome was defined, something must have gone wrong while reading the data in (e.g. specified wrong file path for a gzipped coverage file?). Please check your command!


I also wanted to inquire about this error. While I looked at the log file, I saw this very often: Processed 3000000 sequence pairs so far Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2313:3495:29340_1:N:0:GTCGTTAC+GCTTAGCT NW_022883904.1 2 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2265:16071:7435_1:N:0:GTCGTTAC+GCTTAGCT NW_022883908.1 29954 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2214:25635:36370_1:N:0:GTCGTTAC+GCTTAGCT NW_022883462.1 19111 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2233:28782:9596_1:N:0:GTCGTTAC+GCTTAGCT NW_022884241.1 19018 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2238:29441:32095_1:N:0:GTCGTTAC+GCTTAGCT NW_022883649.1 13623 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2305:16034:34053_1:N:0:GTCGTTAC+GCTTAGCT NW_022883202.1 20994 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2234:16080:20760_1:N:0:GTCGTTAC+GCTTAGCT NW_022883657.1 29689 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2234:16098:20823_1:N:0:GTCGTTAC+GCTTAGCT NW_022883657.1 29689 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2307:14525:10019_1:N:0:GTCGTTAC+GCTTAGCT NW_022883927.1 29210

Is this something that I have to worry about? I indexed the reference genome the standard way, my methylation reads are PE-150.

Could you please tell what is wrong with my code that it did not generate the output files?

Lastly, while reading the log I found this line about directional library in the cytosine report:


Finished generating genome-wide cytosine report

Bowtie 2 seems to be working fine (tested command 'bowtie2 --version' [2.5.3]) Output format is BAM (default) Alignments will be written out in BAM format. Samtools found here: '/home/juaguila/.conda/envs/bismark/bin/samtools' Reference genome folder provided is /home/juaguila/Bismark/ (absolute path is '/home/juaguila/Bismark/)' FastQ format specified

Input files to be analysed (in current folder '/home/juaguila/BombusMethylSeq/Rec-5/BSMK-2'): V00750_R1_val_1.fq.gz V00750_R2_val_2.fq.gz Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)

The library is PE-150. At first I was confused because it was done RNA-seq, which the library itself was directional, but then I realized that in this case, it is not. Is there an argument to state the library as non-directional, or am I misunderstanding the log?

Thank you.

FelixKrueger commented 8 months ago

You need to use the option --scaffolds form the methylation extractor as you seem to have a genom with too many contigsscaffolds.

desmodus1984 commented 8 months ago

Can I rerun the methylation extractor code or should I eliminate any of the previous files (X.methXtractor.temp), or better move the deduplicated files to a new directory?

Thanks

FelixKrueger commented 8 months ago

I would probably remove the .temp files, but it might work even leaving them there...

desmodus1984 commented 8 months ago

I ran the methylator and it worked well now. By curiosity, I checked the MBias files, which I expected to be "good" or "fine" as the ones from methyldackel and I got surprised that they weren't as flat as those from methyldackel. The libraries were made with the enzymatic-methyl-kit from NEB. The R1 look okay, but there is a huge drop at the end, while the R2 have the drop in methylation level, until, similarly, reaching the end of the read, and the huge drop out. V00001 deduplicated M-bias_R2 V00001 deduplicated M-bias_R1 V00315 deduplicated M-bias_R2 V00315 deduplicated M-bias_R1

Do you think that I should add any extra parameters to the methylation extractor? Do you recommend trimming bases from the 3'-end? R1 seems to need more trimming than the R2.

Thanks and sorry for the many questions.

FelixKrueger commented 8 months ago

A couple of comments: