Closed mo716 closed 3 years ago
Hi @mo716 ,
that's quite a lot to digest... Do you think it would be possible to send me one of the C* context calls so I can run a test over here? If so, can you just send me an email (felix.krueger@babraham.ac.uk) so I can send you some FTP details (and don't worry, I won't publish your data before you had a chance :P).
Hi @FelixKrueger ! Sure! I will write you immediately :) And thank you for not publishing my data too :D
Hi @FelixKrueger !
Sorry for the late reply. I followed the suggestions you gave me (to merge the C*txt at the bed2graph step, since for some reason no new lines were being generated within parts of the files before). I actually repeated all the pipeline again (aligning, deduplication, methylation extraction) separately for each library before merging them at the bed2graph step (Note that the --gzip option would return the error "Unknown option: gzip Please respecify command line options" in bed2Graph so I did not use that).
The command I ran was:
bismark2bedGraph --dir /PATH/TO/TEMP/ -o G17.BBCH65 --cutoff 3 --CX --buffer_size 20G --scaffolds /PATH/TO/TEMP/C*.deduplicated.txt
Unfortunately, I still got the error show here at the end:
Summary of parameters for bismark2bedGraph conversion:
bedGraph output: G17.BBCH65.gz output directory: >/PATH/TO/TEMP/< remove whitespaces: no CX context: yes No-header selected: no Sorting method: Unix sort-based (smaller memory footprint, but slower) Sort buffer size: 20G Coverage threshold: 3
Methylation information will now be written into a bedGraph and coverage file
Using the following files as Input: Writing bedGraph to file: G17.BBCH65.gz Also writing out a coverage file including counts methylated and unmethylated residues to file: G17.BBCH65.gz.bismark.cov.gz
Changed directory to /PATH/TO/TEMP/ The genome of interest was specified to contain gazillions of chromosomes or scaffolds. Merging all input files and sorting everything in memory instead of writing out individual chromosome files... Writing all merged methylation calls to temp file G17.BBCH65.gz.methylation_calls.merged
Finished writing methylation calls from CHG_OB_G17_BBCH65_1.deduplicated.txt to merged temp file Finished writing methylation calls from CHG_OB_G17_BBCH65_2.deduplicated.txt to merged temp file Finished writing methylation calls from CHG_OB_G17_BBCH65_3.deduplicated.txt to merged temp file Finished writing methylation calls from CHG_OB_G17_BBCH65_4.deduplicated.txt to merged temp file Finished writing methylation calls from CHG_OB_G17_BBCH65_5.deduplicated.txt to merged temp file Finished writing methylation calls from CHG_OB_G17_BBCH65_6.deduplicated.txt to merged temp file Finished writing methylation calls from CHG_OT_G17_BBCH65_1.deduplicated.txt to merged temp file Finished writing methylation calls from CHG_OT_G17_BBCH65_2.deduplicated.txt to merged temp file Finished writing methylation calls from CHG_OT_G17_BBCH65_3.deduplicated.txt to merged temp file Finished writing methylation calls from CHG_OT_G17_BBCH65_4.deduplicated.txt to merged temp file Finished writing methylation calls from CHG_OT_G17_BBCH65_5.deduplicated.txt to merged temp file Finished writing methylation calls from CHG_OT_G17_BBCH65_6.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OB_G17_BBCH65_1.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OB_G17_BBCH65_2.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OB_G17_BBCH65_3.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OB_G17_BBCH65_4.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OB_G17_BBCH65_5.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OB_G17_BBCH65_6.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OT_G17_BBCH65_1.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OT_G17_BBCH65_2.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OT_G17_BBCH65_3.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OT_G17_BBCH65_4.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OT_G17_BBCH65_5.deduplicated.txt to merged temp file Finished writing methylation calls from CHH_OT_G17_BBCH65_6.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OB_G17_BBCH65_1.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OB_G17_BBCH65_2.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OB_G17_BBCH65_3.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OB_G17_BBCH65_4.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OB_G17_BBCH65_5.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OB_G17_BBCH65_6.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OT_G17_BBCH65_1.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OT_G17_BBCH65_2.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OT_G17_BBCH65_3.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OT_G17_BBCH65_4.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OT_G17_BBCH65_5.deduplicated.txt to merged temp file Finished writing methylation calls from CpG_OT_G17_BBCH65_6.deduplicated.txt to merged temp file Sorting input file G17.BBCH65.gz.methylation_calls.merged by positions (using -S of 20G) Use of uninitialized value $pos in string ne at ~/programs/bismark/Bismark-0.23.0/bismark2bedGraph line 461, <$ifh> line 1. Missing alphabetical methylation call at ~/programs/bismark/Bismark-0.23.0/bismark2bedGraph line 562, <$ifh> line 1. main::validate_methylation_call("h", undef) called at ~/programs/bismark/Bismark-0.23.0/bismark2bedGraph line 466
Then, I ran the bed2Graph function without merging in order to check whether all libraries indeed produced the same error. It turned out that 5 out 6 libraries did produced the same error (and hence, they were all empty).
20 May 7 16:44 G17_1.BBCH65.gz.bismark.cov.gz 40 May 7 16:44 G17_1.BBCH65.gz 40 May 7 16:45 G17_6.BBCH65.gz 20 May 7 16:45 G17_6.BBCH65.gz.bismark.cov.gz 20 May 7 16:46 G17_2.BBCH65.gz.bismark.cov.gz 40 May 7 16:46 G17_2.BBCH65.gz 9.5M May 7 16:54 G17_5.BBCH65.gz.bismark.cov.gz 12M May 7 16:54 G17_5.BBCH65.gz 20 May 7 22:01 G17_4.BBCH65.gz.bismark.cov.gz 40 May 7 22:01 G17_4.BBCH65.gz 20 May 7 22:21 G17_3.BBCH65.gz.bismark.cov.gz 40 May 7 22:21 G17_3.BBCH65.gz
Finally, I re-attempted merging at the bed2Graph step while increasing the buffer size to 100 G (also indicating it as -l virtual_free=100G when submitting to the cluster). But still the error would come out:
Sorting input file G17.BBCH65_100G.gz.methylation_calls.merged by positions (using -S of 100G) Use of uninitialized value $pos in string ne at ~/programs/bismark/Bismark-0.23.0/bismark2bedGraph line 461, <$ifh> line 1. Missing alphabetical methylation call at ~/programs/bismark/Bismark-0.23.0/bismark2bedGraph line 562, <$ifh> line 1. main::validate_methylation_call("h", undef) called at ~/programs/bismark/Bismark-0.23.0/bismark2bedGraph line 466
My guesses: Could this problem be linked with the reads having a nomenclature that is problematic within the bed2Graph script? Could this be a memory issue (my reference genome has many contigs)? The other 16 biological samples were run without any problems, so I am not very sure what could be going on. Indeed, any tips or suggestions are super appreciated! Maybe I could send you some files that you consider relevant, if it is fine with you, so that you could try on your side ? Thank you for any response :)
This is very weird indeed. The error message you are seeing seems to indicate that one of the C context files is truncated, and this might not have anything to do with the coverage file generation whatsoever. Could you maybe upload the C context files (gzipped) to the server? Is it still active?
Hi! I think it is not active anymore. I tried to connect via FileZilla but got:
Response: 331 Please specify the password. Command: PASS **** Response: 530 Login incorrect. Error: Critical error: Could not connect to server
Is there any way I could regain access to it to send you the files? I will gzip them in the meantime and send them as binary. Thank you!
Just sent you some new details, let me know if you didn't receive them.
Hi @mo716
Some of the bismark2bedGraph
runs have now completed, and there do indeed seem to be issues with the input files.
Here are some lines from a file where I only used CpG context:
Finished writing methylation calls from CpG_OB_E23_BBCH16.multiple.deduplicated.txt.gz to merged temp file
Finished writing methylation calls from CpG_OT_E23_BBCH16.multiple.deduplicated.txt.gz to merged temp file
Sorting input file E23_BBCH16_CpG_cutoff3.gz.methylation_calls.merged by positions (using -S of 45G)
Methylation state of sequence (A00202:647:HCHJVDSXY:4:2678:25644:22169_1:N:0:AGTGTCTC+GACCATAG) in file (E23_BBCH16_CpG_cutoff3.gz.methylation_calls.merged) on line 1 is inconsistent (meth_state is -, met
h_state2 = chrA10)
Methylation state of sequence (A00202:647:HCHJVDSXY:4:1151:18213:20635_1:N:0:AGTGTCTC+GACCATAG) in file (E23_BBCH16_CpG_cutoff3.gz.methylation_calls.merged) on line 2 is inconsistent (meth_state is +, met
h_state2 = chrC02)
So I tried to look at just one read in one file of its own:
zcat CpG_OT_E23_BBCH16.multiple.deduplicated.txt.gz | grep A00202:647:HCHJVDSXY:4:2678:25644:22169
A00202:647:HCHJVDSXY:4:2678:25644:22169_1:N:0:AGTGTCTC+GACCATAG + chrC05 22452866 Z
A00202:647:HCHJVDSXY:4:2678:25644:22169_1:N:0:AGTGTCTC+GACCATAG + chrC05 22452885 Z
A00202:647:HCHJVDSXY:4:2678:25644:22169_1:N:0:AGTGTCTC+GACCATAG + chrC05 22452902 Z
A00202:647:HCHJVDSXY:4:2678:25644:22169_1:N:0:AGTGTCTC+GACCATAG + chrC05 22452921 Z
A00202:647:HCHJVDSXY:4:2678:25644:22169_1:N:0:AGTGTCTC+GACCATAG - chrC05 22453234 z
A00202:647:HCHJVDSXY:4:2678:25644:22169_1:N:0:AGTGTCTC+GACCATAG - chrC05 22453218 z
A00202:647:HCHJVDSXY:4:2678:25644:22169_1:N:0:AGTGTCTC+GACCATAG - chrC05 22453195 z
A00202:647:HCHJVDSXY:4:2678:25644:22169_1:N:0:AGTGTCTC+GACCATAG - A00202:647:HCHJVDSXY:4:1104:13584:14309_1:N:0:AGTGTCTC+GACCATAG - chrA10 11273344 z
or here:
zcat CpG_OT_E23_BBCH16.multiple.deduplicated.txt.gz | grep A00202:647:HCHJVDSXY:4:1242:16514:33599
...
A00202:647:HCHJVDSXY:4:1242:16514:33599_1:N:0:CCGTGATT+TGTCACCA - chrA09 43133934 z
A00202:647:HCHJVDSXY:4:1242:16514:33599_1:N:0:CCGTGATT+TGTCACCA - A00202:647:HCHJVDSXY:4:1224:8061:20901_1:N:0:CCGTGATT+TGTCACCA + chrA08 5070130 Z
So for some reason, the C* context files sometimes have lines that appear to be corrupted somehow, or at least do not adhere to the required format specification.
I could imagine two reasons for this:
a) not sure if you, or someone else, performed any kind or merging or operation on these files (or possibly copy them in between platforms such as Linux, OSX and Windows) - this would need fixing
b) another scenario that I could think of might be that you did the methylation extraction using multiple cores (--parallel ...
), and possibly these multiple threads did weird things when writing to the output file, so that one thread injected a line right while another line had been written. Such buffering issues might also depend on your file system, the nodes you used etc, which might explain why it happens only in some cases but not in all of them.
As a solution to b) you should just be able to run the methylation extraction and sorting using a single core. I appreciate that you will have to wait for it for a fairly long time, but on the upside - it should work! (and you would only have to do it once...).
Final tip: If I may give you one final recommendation, I would personally not specify a --cutoff 3
for this step, because the process runs for quite a while. Removing every position with fewer than 3 calls will mean that other positions won't even show up in the coverage file, so there is no going back if you will.
On the other hand, if you include every position (the default behaviour), you can choose to look at all positions, or apply a threshold filter afterwards using a quick post-processing step with sed
or awk
. This would probably only take a few minutes to apply. Does that make sense?
Hi @FelixKrueger ! Thank you very much for the help! I thought about the two possible reasons you mentioned which could be leading to this issue. I would maybe discard option (a) since I haven't transfer the files between operating systems. I have tried merging libraries of the same biological sample etiher at the deduplication or the bed2Graph steps for the 2 problematic samples, but I would still get empty files.
I am now running your suggestions for option (b), so I removed the --parallel
option in the methylation extraction function. The cluster is a bit busy so the job is still queing, but I will write back as soon as I have the output with all the important details of my input, output and command files. Thank you the --cutoff
3 tip too. Hope to update you soon :)
Hi @FelixKrueger ! All is working smoothly now after following your solution to (b). That means, that I ran the command without the --parallel
option for the two problematic samples. I submitted the job in a computer cluster requesting 75G of virtual free memory and a single core. The ran command was:
~/programs/bismark/Bismark-0.23.0/bismark_methylation_extractor --ignore 2 --ignore_r2 2 -o /PATH/TO/TEMP ${INPUT}
This produced the coverage and bed2Graphs files without any error as shown below:
Sorting input file E23.BBCH16.gz.methylation_calls.merged by positions (using -S of 20G) Successfully deleted the temporary input file E23.BBCH16.gz.methylation_calls.merged
(END)
Sorting input file G17.BBCH65.gz.methylation_calls.merged by positions (using -S of 20G) Successfully deleted the temporary input file G17.BBCH65.gz.methylation_calls.merged
(END)
So as you mentioned, maybe the multiple threads were doing something weird ("one thread injected a line right while another line had been written"). If helpful, I would like to say that, the single core methylation lasted around 7 hours for sample G17 (26 GB deduplicated BAM file) and 15 hours for sample E23 (30 GB deduplicated BAM file). Please note that these results are based on a deduplicated BAM file containing merged libraries (i.e. I trimmed and aligned technical replicates separately, and merged them at the deduplication step).
The generated C* context files had altogether a size of 314 GB and 366 GB for the G17 and E23 sample respectively. The bed2Graph would last about 1-2 hours and produced coverage and bed2Graphs files of ca. 1 G when using a buffer size of 20G and cutoff of 3. Overall, it was quite fast :)
Please feel free to close this issue. Again, thank you very much! :)
Excellent, I am glad it has now all worked!
Dear Felix,
I hope you are doing fine. I am currently having an issue with certain libraries when running the bismark2bedGraph function. This is the command I have used:
~/programs/bismark/Bismark-0.23.0/bismark2bedGraph -dir /TEMP/bismark -o $OUTPUT.NAME --cutoff 3 --CX --buffer_size 20G --scaffolds C*txt
I am using the scaffolds option because my genome reference has around 1400 contigs. The command above generates the bedGraph and coverage files in 16 out of 18 biological samples. With "successfully", I am referring to both files not being empty. That means, that the files do not have any information in them for the other 2 samples Moreover, the temporary "methylation_calls.merged" file is not deleted from these 2 samples after the command finishes; so maybe the whole command is interrupted somewhere, and thus, the files are empty. This is how the output directory for the two samples looks after the command finishes:
The content of the files when open is empty as shown below:
zcat G17.gz | less
zcat G17.gz.bismark.cov.gz | less
zcat E23.gz | less
zcat E23.gz.bismark.cov.gz | less
I read about a similar issue (https://github.com/FelixKrueger/Bismark/issues/279) that was solved by changing the buffer size. I am also running the job in a computer cluster using SGE, so I changed the command to increase the virtual memory as follows:
#!/bin/bash
#$ -S /bin/bash
#$ -N bismark2bedgraph
#$ -cwd
#$ -t 1
#$ -pe multislot 1
#$ -q all.q
#$ -l virtual_free=100G
~/programs/bismark/Bismark-0.23.0/bismark2bedGraph --dir /TEMP/bismark -o $OUTPUT.NAME --cutoff 3 --CX --buffer_size 100G --scaffolds C*txt
Nevertheless, the bedGraph and coverage files are still empty :/ . These are the standard errors I have from the two files (when adjusting buffer size to 100G)
Sample E23
Sample G17
I also checked whether all of the output files from the methylation extraction were truncated at the end or not
tail C*E23.multiple.deduplicated.txt
tail C*G17.multiple.deduplicated.txt
All of them had the information regarding methylation on the last lines based on tail. I also check if the merged files were truncated at the end:
tail E23.gz.methylation_calls.merged
tail G17.gz.methylation_calls.merged
The end of both files looks fine. I re-ran the methylation extraction step as shown below:
~/programs/bismark/Bismark-0.23.0/bismark_methylation_extractor --ignore 2 --ignore_r2 2 --parallel 5 -o /ceph/sge-tmp/bismark ${INPUT}
But I still had the same issues at the bed2Graph step. I would really appreciate any help or suggestions in how to proceed with these two samples. As a background: I had multiple libraries made with the Accel Swift kit, so I followed your suggestions on trimming (https://github.com/FelixKrueger/Bismark/issues/422), and then I aligned against the reference with Bismark-0.23.0. Afterwards, I joined the technical replicates from each of the 18 biological replicates during the deduplication step before continuing with the bed2Graph part:
~/programs/bismark/Bismark-0.23.0/deduplicate_bismark --multiple --output_dir ./testing -o ${OUTPUT} ${INPUT}
Thank you once again for your patience and for any answer in advance :)