desmodus1984 commented 4 months ago

Hi,

I am studying differential methylation in bumblebees, and after using methylKit to find differentially methylated sites it seems better to find differentially methylated regions. I wanted to find DMR using the software swDMR (unless you suggest a different/better one) and it needs as input the cytosine report with strand and counts.

I am contacting you because for methylKit I have the bam files for methylation calling but I did pre-process them in a different way, not with Bismark. I trimmed the reads based on the suggested EM-seq parameters, and I checked for methylation-bias with Methyldackel with regular trimming parameters. After confirming best trimming parameters, I aligned the trimmed reads with bwa-meth, sam-to-bam conversion, and deduplication/elimination with Picard.

I wanted to ask if I can use the bam files that I have already generated to make the cytosine report or if I should re-processed my files all over again with Bismark.

Thank you again.

FelixKrueger commented 4 months ago

I am afraid I don't think the BAM files are compatible :(, and you would indeed have to re-start from the alignment step...

desmodus1984 commented 4 months ago

Good morning,

I installed Bismark as mentioned. When I tried doing the genome indexing it failed

home/juaguila/appz/Bismark-0.24.2/bismark_genome_preparation ../Bismark/ Writing bisulfite genomes out into a single MFA (multi FastA) file

Bisulfite Genome Indexer version v0.24.2 (last modified: 19 May 2022)

Step I - Prepare genome folders - completed

Total number of conversions performed: C->T: 52215405 G->A: 52297904

Step II - Genome bisulfite conversions - completed

Bismark Genome Preparation - Step III: Launching the Bowtie 2 indexer Please be aware that this process can - depending on genome size - take several hours! /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: versionGLIBCXX_3.4.20' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: version CXXABI_1.3.8' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: versionGLIBCXX_3.4.26' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: version GLIBCXX_3.4.30' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: versionGLIBCXX_3.4.22' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) /home/applications/bowtie2/2.5.3/bowtie2-build-s: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/applications/bowtie2/2.5.3/bowtie2-build-s) Traceback (most recent call last): File "/home/applications/bowtie2/2.5.3/bowtie2-build", line 134, in main() File "/home/applications/bowtie2/2.5.3/bowtie2-build", line 68, in main if sais_enabled(build_bin_spec): File "/home/applications/bowtie2/2.5.3/bowtie2-build", line 42, in sais_enabled output = subprocess.check_output([build_bin_spec, "--version"], universal_newlines=True) File "/usr/lib64/python3.6/subprocess.pyhttp://subprocess.py/", line 356, in check_output **kwargs).stdout File "/usr/lib64/python3.6/subprocess.pyhttp://subprocess.py/", line 438, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['/home/applications/bowtie2/2.5.3/bowtie2-build-s', '--version']' returned non-zero exit status 1. Parent process: Failed to build index

One option that I am doubtful is that bowtie2 is installed in the HPC as a module and I loaded. Do I have to figure out the path of bowtie2 or it is obtained automatically?

Thanks,

Juan Pablo Aguilar

From: Felix Krueger @.> Sent: Tuesday, February 20, 2024 5:12:21 AM To: FelixKrueger/Bismark @.> Cc: Aguilar Cabezas, Juan Pablo @.>; Author @.> Subject: [External] Re: [FelixKrueger/Bismark] Generating cytosine report (Issue #655)

Use caution with links and attachments.

I am afraid I don't think the BAM files are compatible :(, and you would indeed have to re-start from the alignment step...

— Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/655#issuecomment-1953881646, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJWD2VJMA5CRM7P4JVOH74DYURZILAVCNFSM6AAAAABDP7LTP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJTHA4DCNRUGY. You are receiving this because you authored the thread.Message ID: @.***>

FelixKrueger commented 4 months ago

Hmm, this does indeed look like a Bowtie2 installation issue. If you don't specify the path yourself it will look in the PATH environment variable. It seems to use it from /home/applications/bowtie2/2.5.3/bowtie2-build.

You can find out the path with the command which bowtie2.

When you run /home/applications/bowtie2/2.5.3/bowtie2-build --help you should be able to see the help text. Otherwise, I would recommend you reintall Bowtie2 (or install Bismark using conda install bismark, which will also satisfy all other requirements)

desmodus1984 commented 4 months ago

I did actually think about that but the version in conda is 0.22 while the last one is 0.24.

Since I was told to best use the latest version due to bugs and updates.

From: Felix Krueger @.> Sent: Wednesday, February 21, 2024 10:20:32 AM To: FelixKrueger/Bismark @.> Cc: Aguilar Cabezas, Juan Pablo @.>; Author @.> Subject: [External] Re: [FelixKrueger/Bismark] Generating cytosine report (Issue #655)

Use caution with links and attachments.

Hmm, this does indeed look like a Bowtie2 installation issue. If you don't specify the path yourself it will look in the PATH environment variable. It seems to use it from /home/applications/bowtie2/2.5.3/bowtie2-build.

You can find out the path with the command which bowtie2.

When you run /home/applications/bowtie2/2.5.3/bowtie2-build --help you should be able to see the help text. Otherwise, I would recommend you reintall Bowtie2 (or install Bismark using conda install bismark, which will also satisfy all other requirements)

— Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/655#issuecomment-1956921278, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJWD2VIZAZWLUMOZO6BR2HTYUYGEBAVCNFSM6AAAAABDP7LTP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJWHEZDCMRXHA. You are receiving this because you authored the thread.Message ID: @.***>

FelixKrueger commented 4 months ago

The latest version on bioconda is 0.24.2, the instructions are here: https://bioconda.github.io/recipes/bismark/README.html

desmodus1984 commented 4 months ago

Hi,

I Installed Bismark in conda, and I am trying to run the workflow and I would appreciate if you could check my code, and tell me what I am writing wrong or missing.

I loaded the environment, and I am running the code in the folder, as an .sh file, and it is not creating any bam output. Here is my code which worked with bs_seeker2.

I start with a loop to grab the names of the files, but then alignment doesn't happen.

`for i in *_R1_val_1.fq.gz do base=$(basename $i "_R1_val_1.fq.gz")

    bismark -q --parallel 8  -B ${base} \
    --genome /DataDrive/juaguila/Bvos-trimmed/BM-lot/ \
     -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz 

    deduplicate_bismark --bam ${base}.bam

    bismark_methylation_extractor -p --gzip --genome_folder /DataDrive/juaguila/Bvos-trimmed/BM-lot/ \
    --cytosine_report --multicore 8  ${base}.deduplicated.bam

    done

`

Which I run this: nohup sh BS2.sh > test.bismark.log

The genome was created in the genome directory.

The log is as follows:

Bowtie 2 seems to be working fine (tested command 'bowtie2 --version' [2.4.1]) Output format is BAM (default) Alignments will be written out in BAM format. Samtools found here: '/home/juaguila/miniconda3/envs/bismark/bin/samtools' Reference genome folder provided is /DataDrive/juaguila/Bvos-trimmed/BM-lot/ (absolute path is '/DataDrive/juaguila/Bvos-trimmed/BM-lot/)' FastQ format specified

Input files to be analysed (in current folder '/DataDrive/juaguila/Bvos-trimmed/BM-lot'): V00001_R1_val_1.fq.gz V00001_R2_val_2.fq.gz Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!) Specifying --basename in conjuction with --multicore is currently not supported (but we are aiming to fix this soon). Please lose either --basename or --multicore to proceed

Neither -s (single-end) nor -p (paired-end) selected for deduplication. Trying to extract this information for each file separately from the @PG line of the SAM/BAM file Processing single-end Bismark output file(s) (SAM format): V00001.bam

If there are several alignments to a single position in the genome the first alignment will be chosen. Since the input files are not in any way sorted this is a near-enough random selection of reads.

Checking file >>V00001.bam<< for signs of file truncation... Captured error message: '[E::hts_open_format] Failed to open file V00001.bam'

[ERROR] The file appears to be truncated, please ensure that there were no errors while copying the file!!! Exiting...

I would appreciate if you could tell me what parameter I am missing

FelixKrueger commented 4 months ago

The error here is this:

Specifying --basename in conjuction with --multicore is currently not supported (but we are aiming to fix this soon). Please lose either --basename or --multicore to proceed

So the Bismark could simply be:

bismark --parallel 8 --genome /DataDrive/juaguila/Bvos-trimmed/BM-lot/ -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz

This is fine:

deduplicate_bismark ${base}.bam

The following command will need --bedGraph I believe:

bismark_methylation_extractor --gzip --bedGraph --genome_folder /DataDrive/juaguila/Bvos-trimmed/BM-lot/ --cytosine_report --multicore 8  ${base}.deduplicated.bam

desmodus1984 commented 4 months ago

Thanks, I ran the code and I made one change, in addition to eliminating the base (-B) parameter, I changed multicore to parallel. Could you explain what is the difference? I thought they are the same, but maybe I am wrong.

Here is the code

for i in *_R1_val_1.fq.gz
        do
        base=$(basename $i "_R1_val_1.fq.gz")
        bismark -q --parallel 8 \
        --genome /home/juaguila/Bismark/ \
         -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz
        deduplicate_bismark --bam ${base}.bam
        bismark_methylation_extractor -p --gzip --genome_folder /DataDrive/juaguila/Bvos-trimmed/BM-lot/ \
        --cytosine_report --multicore 8  ${base}.deduplicated.bam
        done

I have a question too. I thought that the basename (${base}) would work well, and I would have a file called whatever is before *_R1_val_1.fq.gz like V00001.bam, as it did when I ran bs-seeker and cgmaptools. Contrarily, iInstead it created files with the entire XXXX_R1_val_1.fq.gz suffix

[juaguila@v001 BSMK-1]$ cd ../BSMK-0/ [juaguila@v001 BSMK-0]$ ls -ltr total 103335076 -rw-r--r-- 1 juaguila hpc_jfierst 3999335475 Jun 8 2023 V00001_R2_val_2.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 3675649845 Jun 8 2023 V00001_R1_val_1.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 3436679091 Jun 8 2023 V00021_R2_val_2.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 3104862452 Jun 8 2023 V00021_R1_val_1.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 3079178155 Jun 8 2023 V00294_R2_val_2.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 2823807574 Jun 8 2023 V00294_R1_val_1.fq.gz -rw-r--r-- 1 juaguila hpc_jfierst 457 Mar 3 21:49 BSMK-test.sh -rw-r--r-- 1 juaguila hpc_jfierst 7654589585 Mar 4 06:55 V00001_R1_val_1_bismark_bt2_pe.bam -rw-r--r-- 1 juaguila hpc_jfierst 1867 Mar 4 06:55 V00001_R1_val_1_bismark_bt2_PE_report.txt -rw-r--r-- 1 juaguila hpc_jfierst 6195867738 Mar 4 11:14 V00021_R1_val_1_bismark_bt2_pe.bam -rw-r--r-- 1 juaguila hpc_jfierst 1868 Mar 4 11:14 V00021_R1_val_1_bismark_bt2_PE_report.txt

Did I write something wrong? I thought that the code was specifying just the XXX part; that is why I used the -B base parameter to specify the file name, and then create the loop/sequence for the other codes. I thought that the alignment would write a bam file called XXX.bam and not a XXX_R1_val_1.fq.gz file.

FelixKrueger commented 4 months ago

The options --parallel/--multicore are identical.

I believe for the output files, only the ending is stripped off (e.g. .fq.gz). You can always rename files afterwards so they are to your liking, e.g.

rename s/_R1_val_1_bismark_/_/ *

The file XXX_R1_val_1.fq.gz is the input file (and is not created by Bismark...).

desmodus1984 commented 3 months ago

Good evening,

I followed your code for the methylation extractor, and I didn't get the bedgraph ouput nor the genome-wide methylation report files, which I require to do the detection of differentially methylated regions.

The code was:

for i in *_R1_val_1.fq.gz
        do
        base=$(basename $i "_R1_val_1.fq.gz")

        bismark -q --parallel 10 \
        --genome /home/juaguila/Bismark/ \
         -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz

        deduplicate_bismark -p --bam ${base}_R1_val_1_bismark_bt2_pe.bam -o ${base}

        bismark_methylation_extractor -p --gzip --bedGraph --buffer_size 20G --genome_folder /home/juaguila/Bismark/ \
        --cytosine_report --parallel 10  ${base}.deduplicated.bam

done

The log tells this error:

Methylation information will now be written into a bedGraph and coverage file

Using the following files as Input: /home/juaguila/BombusMethylSeq/Rec-5/BSMK-2/CpG_OT_V00747.deduplicated.txt.gz /home/juaguila/BombusMethylSeq/Rec-5/BSMK-2/CpG_OB_V00747.deduplicated.txt.gz

Writing bedGraph to file: V00747.deduplicated.bedGraph.gz Also writing out a coverage file including counts methylated and unmethylated residues to file: V00747.deduplicated.bismark.cov.gz

Now writing methylation information for file >>CpG_OT_V00747.deduplicated.txt.gz<< to individual files for each chromosome Failed to open filehandle: Too many open files at /home/juaguila/.conda/envs/bismark/bin/bismark2bedGraph line 279, line 276304. Finished BedGraph conversion ...

And for the cytosine report.

Methylation information will now be written into a genome-wide cytosine report

Adding context-specific methylation summaries

Writing genome-wide cytosine report to: V00314.deduplicated.CpG_report.txt.gz <<<

Writing all cytosine context summary file to: V00314.deduplicated.cytosine_context_summary.txt <<<

No last chromosome was defined, something must have gone wrong while reading the data in (e.g. specified wrong file path for a gzipped coverage file?). Please check your command!

I also wanted to inquire about this error. While I looked at the log file, I saw this very often: Processed 3000000 sequence pairs so far Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2313:3495:29340_1:N:0:GTCGTTAC+GCTTAGCT NW_022883904.1 2 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2265:16071:7435_1:N:0:GTCGTTAC+GCTTAGCT NW_022883908.1 29954 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2214:25635:36370_1:N:0:GTCGTTAC+GCTTAGCT NW_022883462.1 19111 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2233:28782:9596_1:N:0:GTCGTTAC+GCTTAGCT NW_022884241.1 19018 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2238:29441:32095_1:N:0:GTCGTTAC+GCTTAGCT NW_022883649.1 13623 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2305:16034:34053_1:N:0:GTCGTTAC+GCTTAGCT NW_022883202.1 20994 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2234:16080:20760_1:N:0:GTCGTTAC+GCTTAGCT NW_022883657.1 29689 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2234:16098:20823_1:N:0:GTCGTTAC+GCTTAGCT NW_022883657.1 29689 Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2307:14525:10019_1:N:0:GTCGTTAC+GCTTAGCT NW_022883927.1 29210

Is this something that I have to worry about? I indexed the reference genome the standard way, my methylation reads are PE-150.

Could you please tell what is wrong with my code that it did not generate the output files?

Lastly, while reading the log I found this line about directional library in the cytosine report:

Finished generating genome-wide cytosine report

Bowtie 2 seems to be working fine (tested command 'bowtie2 --version' [2.5.3]) Output format is BAM (default) Alignments will be written out in BAM format. Samtools found here: '/home/juaguila/.conda/envs/bismark/bin/samtools' Reference genome folder provided is /home/juaguila/Bismark/ (absolute path is '/home/juaguila/Bismark/)' FastQ format specified

Input files to be analysed (in current folder '/home/juaguila/BombusMethylSeq/Rec-5/BSMK-2'): V00750_R1_val_1.fq.gz V00750_R2_val_2.fq.gz Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)

The library is PE-150. At first I was confused because it was done RNA-seq, which the library itself was directional, but then I realized that in this case, it is not. Is there an argument to state the library as non-directional, or am I misunderstanding the log?

Thank you.

FelixKrueger commented 3 months ago

You need to use the option --scaffolds form the methylation extractor as you seem to have a genom with too many contigsscaffolds.

desmodus1984 commented 3 months ago

Can I rerun the methylation extractor code or should I eliminate any of the previous files (X.methXtractor.temp), or better move the deduplicated files to a new directory?

Thanks

FelixKrueger commented 3 months ago

I would probably remove the .temp files, but it might work even leaving them there...

desmodus1984 commented 3 months ago

I ran the methylator and it worked well now. By curiosity, I checked the MBias files, which I expected to be "good" or "fine" as the ones from methyldackel and I got surprised that they weren't as flat as those from methyldackel. The libraries were made with the enzymatic-methyl-kit from NEB. The R1 look okay, but there is a huge drop at the end, while the R2 have the drop in methylation level, until, similarly, reaching the end of the read, and the huge drop out. V00001 deduplicated M-bias_R2 V00001 deduplicated M-bias_R1 V00315 deduplicated M-bias_R2 V00315 deduplicated M-bias_R1

Do you think that I should add any extra parameters to the methylation extractor? Do you recommend trimming bases from the 3'-end? R1 seems to need more trimming than the R2.

Thanks and sorry for the many questions.

FelixKrueger commented 3 months ago

A couple of comments:

it might be easier to run bismark2report once the run is complete, as this give you an interactive HTML version of all reports, including the M-bias plots
as far as I can see, almost all reads have <1% of total methylation in any context, is that correct? This is super constant across all reads, and there are neither spikes nor drop-offs
what you are probably referring to is the number of calls per position for each context. For R1, this is fairly constant throughout. For R2 you see a steady decline for calls in all contexts, the reason for which is the overlap detection and removal (as soon as R2 starts overlapping R1, the extraction stops and proceeds to the next read
steep 'drops' in the number of calls at the very end are typically caused by adapter clipping and/or quality trimming. Both phenomena are expected, and there is nothing you need to do about it.

FelixKrueger / Bismark

Generating cytosine report #655

Methylation information will now be written into a bedGraph and coverage file

Methylation information will now be written into a genome-wide cytosine report