FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

Bismark mapping problem #169

Closed Holly0123 closed 6 years ago

Holly0123 commented 6 years ago

I tried to use Bismark to analyze BS-seq(paired end), but it stopped at mapping step (using Bowtie2). The command line I used was: bismark -u 10000 --bowtie2 /ref_path/ -1 file_R1.fq -2 file_R2.fq Please check the log file below. Could you let me know the way to solve the problem. Thanks a lot.

########################################################################## Path to Bowtie 2 specified as: bowtie2 Output format is BAM (default) Alignments will be written out in BAM format. Samtools found here: '/mnt/lustre/groups/groupso/bin/anaconda3/bin/samtools' Reference genome folder provided is /mnt/lustre/groups/groupso/REFERENCES/human38/ (absolute path is '/mnt/lustre/groups/groupso/REFERENCES/human38/)' FastQ format assumed (by default) Processing sequences up to read no. 10000 from the input file Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!) Setting parallelization to single-threaded (default)

Single-core mode: setting pid to 1

Paired-end alignments will be performed

The provided filenames for paired-end alignments are B00832PL_S15_L002_R1_001_val_1.fq and B00832PL_S15_L002_R2_001_val_2.fq Input files are in FastQ format Processing reads up to sequence no. 10000 from B00832PL_S15_L002_R1_001_val_1.fq Writing a C -> T converted version of the input file B00832PL_S15_L002_R1_001_val_1.fq to B00832PL_S15_L002_R1_001_val_1.fq_C_to_T.fastq

Created C -> T converted version of the FastQ file B00832PL_S15_L002_R1_001_val_1.fq (10001 sequences in total)

Processing reads up to sequence no. 10000 from B00832PL_S15_L002_R2_001_val_2.fq Writing a G -> A converted version of the input file B00832PL_S15_L002_R2_001_val_2.fq to B00832PL_S15_L002_R2_001_val_2.fq_G_to_A.fastq

Created G -> A converted version of the FastQ file B00832PL_S15_L002_R2_001_val_2.fq (10001 sequences in total)

Input files are B00832PL_S15_L002_R1_001_val_1.fq_C_to_T.fastq and B00832PL_S15_L002_R2_001_val_2.fq_G_to_A.fastq (FastQ) Now running 2 instances of Bowtie 2 against the bisulfite genome of /mnt/lustre/groups/groupso/REFERENCES/human38/ with the specified options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500

Now starting a Bowtie 2 paired-end alignment for CTread1GAread2CTgenome (reading in sequences from B00832PL_S15_L002_R1_001_val_1.fq_C_to_T.fastq and B00832PL_S15_L002_R2_001_val_2.fq_G_to_A.fastq, with the options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500 --norc)) Found first alignment: D00505:255:H3HYJBCX2:2:1101:1249:2223_1:N:0:AGGCAGAA/1 77 0 0 0 0 GATATGTTGAAGTTTTAATTTTTAGTATGTGTGAATGTGATTTTGTTTGGAAATGGTTTTTGTAAGTATATTTATGTGAAGATGAGGTAGTATAGGGTAAG GGGGGGGGIGIGGIIIIGIIIIIIIIIIIIIIIGGIIGGGIGIGGIIIIIIIIIIIIIIGIGIIIGGGIIIIIIIIIIIIIGGIIIIIIIIGGIIIGIIIG YT:Z:UP D00505:255:H3HYJBCX2:2:1101:1249:2223_2:N:0:AGGCAGAA/2 141 0 0 0 0 TCCTTCATATATATATATATACATATATATACATATATATATCTAAATCTTCACATAACATCCTTCTCTCTATATATCCATATCCATATTTATCCATATTT AAGGGIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIGIIIGGIIGGGGIIIGIIIIGGGGIIIIIIIIGIIIIIIIGGIIIIIIIGIIIGIIIIGGIGIIG YT:Z:UP Now starting a Bowtie 2 paired-end alignment for CTread1GAread2GAgenome (reading in sequences from B00832PL_S15_L002_R1_001_val_1.fq_C_to_T.fastq and B00832PL_S15_L002_R2_001_val_2.fq_G_to_A.fastq, with the options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500 --nofw))

FelixKrueger commented 6 years ago

It seems that the FastQ files have been bisulfite transcribed alright, and the first instance of Bowtie2 has returned a result already, however the second one hasn't. Did the program die or is it still running? Are there any further error messages than that? If it hangs at this point I could only guess that you are running this on a machine with a quite low amount of RAM, is that a possibilty?

Holly0123 commented 6 years ago

I used 4 cores, 16Gb per core, and I killed the job after 2 hours running (because it should not take 2 hours to run 10000 sequences). Before I killed the job it was still running, and no other error was shown in the log file. Is this too low for Bismark? Thank you so much.

FelixKrueger commented 6 years ago

The latest human genome is quite huge, but at this point it should not have been using all that much memory to be honest. And no, this command shouldn't take more than a minute or so... Which version of Bismark were you using? You could of course try to use more RAM and see if that completes...

Holly0123 commented 6 years ago

The Bismark version I used was v0.14.3. Do you think it matters?

FelixKrueger commented 6 years ago

Not sure to be honest, but it is certainly worthwhile upgrading to 0.19.1 anyway which I have just released a few minutes ago. Just give it a go.

Holly0123 commented 6 years ago

I will update it, and give more RAM. Thanks very much.

Holly0123 commented 6 years ago

Hi, I updated the version and used the same reference genome, but got "Exiting because chromosome name already exists. Please make sure all chromosomes have a unique name!" error. Could you help? Thanks.

FelixKrueger commented 6 years ago

The error message Exiting because chromosome name already exists. Please make sure all chromosomes have a unique name! is normally caused by either

In this case, entire_genome.fa would contain the exact same chromosome names again and kill the Bismark run.

Holly0123 commented 6 years ago

Hi Felix, I had weird number of chromosomes, please check the problem below. Could you let me know how to solve it, thanks a lot.

############################################################ Now reading in and storing sequence information of the genome specified in: /mnt/lustre/groups/groupso/REFERENCES/human38/part/

chr 0 (1657 bp) chr 1 (632 bp) chr 2 (1351 bp) chr 3 (68 bp) chr 4 (712 bp) chr 5 (535 bp) chr 6 (138 bp) chr 7 (1187 bp) chr 8 (590 bp) chr 9 (840 bp) chr 10 (940 bp) chr 11 (918 bp) chr 12 (2748 bp) chr 13 (491 bp) chr 14 (629 bp) chr 15 (723 bp) chr 16 (336 bp) chr 17 (1319 bp) chr 18 (3812 bp) chr 19 (755 bp) chr 20 (284 bp) chr 21 (323 bp) chr 22 (4860 bp) chr 23 (518 bp) . . . chr 199231 (144 bp) chr 199232 (831 bp) chr 199233 (510 bp)

FelixKrueger commented 6 years ago

To be honest I am not sure what I am supposed to be looking at? The human genome should have some 21 chromosomes, plus MT, X and Y. What are all these differently sized fragments? It might be worth downloading the genome from Ensembl (or NCBI) again?

FelixKrueger commented 6 years ago

Hmm, did you manage to figure it out?

Holly0123 commented 6 years ago

Hi Felix,

Thanks for your asking. Yes, it works for me although I can't tell the reason. I just downloaded a new reference genome and did mapping again.

Thanks for your time.

Best, Haoli

Sent from my iPhone

On Jul 19, 2018, at 5:16 PM, FelixKrueger notifications@github.com<mailto:notifications@github.com> wrote:

Hmm, did you manage to figure it out?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FFelixKrueger%2FBismark%2Fissues%2F169%23issuecomment-406211229&data=01%7C01%7Chaoli.li%40kcl.ac.uk%7C184d18c87b18416194d208d5ed584ca5%7C8370cf1416f34c16b83c724071654356%7C0&sdata=aOijBvLTQKW6P5XMBWBOKDwwtTb%2FTFxXFkn2VUJRlJU%3D&reserved=0, or mute the threadhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAj9XIPK9LWaC4urqtcbhTMQb1BKag2yBks5uIE5lgaJpZM4TlQ07&data=01%7C01%7Chaoli.li%40kcl.ac.uk%7C184d18c87b18416194d208d5ed584ca5%7C8370cf1416f34c16b83c724071654356%7C0&sdata=FuwRrlTBrWBycdqs%2F4ngl1K5gPKkuX9%2Fp996tuML%2B3I%3D&reserved=0.

FelixKrueger commented 6 years ago

Excellent, I'll close the issue then. Best, Felix

jasonsaunderswilliams commented 5 years ago

Hi Felix, I wonder if you can help, I had the same error message at the end of the genome preparation step: docker run -ti --rm -v /data/genomes/:/data/genomes/ -v /data/Jason/:/inputfiles aryeelab/bismark bismark_genome_preparation --verbose /data/genomes/ Path to genome folder specified as: /data/genomes/ Aligner to be used: Bowtie 2 (default) Writing bisulfite genomes out into a single MFA (multi FastA) file

Bismark Genome Preparation - Step I: Preparing folders

Bisulfite Genome Indexer version v0.18.2 (last modified 07 November 2016) Created Bisulfite Genome folder /data/genomes/Bisulfite_Genome/ Created Bisulfite Genome folder /data/genomes/Bisulfite_Genome/CT_conversion/ Created Bisulfite Genome folder /data/genomes/Bisulfite_Genome/GA_conversion/

Step I - Prepare genome folders - completed

Bismark Genome Preparation - Step II: Bisulfite converting reference genome

conversions performed: chromosome C->T G->A chr1 47024412 47016562 chr2 47915465 47947042 chr3 38653197 38670110 chr4 35885806 35890822 chr5 35089383 35129186 chr6 33143287 33163423 chr7 31671670 31636979 chrX 29813353 29865831 chr8 28703983 28702621 chr9 24826212 24813259 chr10 27308648 27298423 chr11 27236798 27268038 chr12 26634995 26617050 chr13 18412698 18414776 chr14 18027132 18071947 chr15 17247582 17228387 chr16 17630040 17701988 chr17 17727956 17700340 chr18 14838685 14863671 chr20 13107828 13149412 chrY 5099171 5153288 chr19 13478255 13511145 chr22 8375984 8369235 chr21 7160212 7174721 chr6_ssto_hap7 918220 926944 chr6_mcf_hap5 858581 861135 chr6_cox_hap2 1069806 1072759 chr6_mann_hap4 903022 907442 chr6_apd_hap1 509167 510390 chr6_qbl_hap6 967043 969902 chr6_dbb_hap3 943550 945266 chr17_ctg5_hap1 355909 362783 chr4_ctg9_hap1 107556 107212 chr1_gl000192_random 112730 114441 chrUn_gl000225 48931 51700 chr4_gl000194_random 41521 41306 chr4_gl000193_random 40616 40608 chr9_gl000200_random 37202 37514 chrUn_gl000222 40866 41304 chrUn_gl000212 42454 40482 chr7_gl000195_random 37021 37349 chrUn_gl000223 38849 39107 chrUn_gl000224 37430 40355 chrUn_gl000219 35501 36108 chr17_gl000205_random 36716 36146 chrUn_gl000215 36250 36223 chrUn_gl000216 46717 25602 chrUn_gl000217 32578 32131 chr9_gl000199_random 34981 29426 chrUn_gl000211 31968 32507 chrUn_gl000213 33831 33346 chrUn_gl000220 40720 37697 chrUn_gl000218 33296 33774 chr19_gl000209_random 35443 38565 chrUn_gl000221 29886 30152 chrUn_gl000214 27484 29698 chrUn_gl000228 35495 34146 chrUn_gl000227 26250 26396 chr1_gl000191_random 23785 23413 chr19_gl000208_random 18369 16539 chr9_gl000198_random 15617 18485 chr17_gl000204_random 22701 21585 chrUn_gl000233 9504 9985 chrUn_gl000237 10241 11162 chrUn_gl000230 9338 8886 chrUn_gl000242 10033 11030 chrUn_gl000243 10304 9636 chrUn_gl000241 7864 7853 chrUn_gl000236 8432 9026 chrUn_gl000240 8962 8880 chr17_gl000206_random 10483 11285 chrUn_gl000232 8490 8512 chrUn_gl000234 8725 8732 chr11_gl000202_random 11254 10645 chrUn_gl000238 7805 8171 chrUn_gl000244 8564 8857 chrUn_gl000248 9265 8900 chr8_gl000196_random 7910 7519 chrUn_gl000249 8978 9033 chrUn_gl000246 7354 7392 chr17_gl000203_random 6074 6408 chr8_gl000197_random 9883 10140 chrUn_gl000245 6707 6602 chrUn_gl000247 7794 8086 chr9_gl000201_random 10373 11114 chrUn_gl000235 6585 6514 chrUn_gl000239 7268 8089 chr21_gl000210_random 7726 7251 chrUn_gl000231 6203 6029 chrUn_gl000229 5385 4601 chrM 5192 2180 chrUn_gl000226 2626 3231 chr18_gl000207_random 588 1285 Exiting because chromosome name already exists. Please make sure all chromosomes have a unique name!

FelixKrueger commented 5 years ago

Hi Jason,

The error message is quite literally the problem here:

Exiting because chromosome name already exists. Please make sure all chromosomes have a unique name!

You will need to go to the genome folder, and take a look why you've got a duplicate chromosomes in there. I noticed that you are using a fairly old version of Bismark (we are currently at v0.22.1), but the current version will also tell you the name of the chromosome that was found duplicated.

What often happens is that people have several different chromosome files in a folder, and in addition to that also keep a single file that contains all chromsome in a single file, e.g. human_genome.fa. I hope that updating Bismark will give you the necessary clue to identify the culprit.

Cheers, Felix

jasonsaunderswilliams commented 5 years ago

Cheers Felix! I think the problem was that we had two genomes in our genome folder within our server. Bismark completed genome preparation and bisulfite conversion. I now however have a new problem when Bismark is indexing of the CT converted genome, I wonder if you can help: Bismark Genome Preparation - Step III: Launching the Bowtie 2 indexer Please be aware that this process can - depending on genome size - take several hours!

Preparing indexing of CT converted genome in /data/Jason/FASTQ_and_processed_data/genomes/Bisulfite_Genome/CT_conversion/ Parent process: Starting to index C->T converted genome with the following command:

bowtie2-build -f genome_mfa.CT_conversion.fa BS_CT

Settings: Output files: "BS_CT..bt2" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void:8, int:4, long:8, size_t:8 Input files DNA, FASTA: genome_mfa.CT_conversion.fa Building a SMALL index Reading reference sizes Preparing indexing of GA converted genome in /data/Jason/FASTQ_and_processed_data/genomes/Bisulfite_Genome/GA_conversion/ Child process: Starting to index G->A converted genome with the following command:

bowtie2-build -f genome_mfa.GA_conversion.fa BS_GA

Settings: Output files: "BS_GA..bt2" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void:8, int:4, long:8, size_t:8 Input files DNA, FASTA: genome_mfa.GA_conversion.fa Building a SMALL index Reading reference sizes Time reading reference sizes: 00:00:46 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time reading reference sizes: 00:00:46 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:00:29 bmax according to bmaxDivN setting: 724327615 Using parameters --bmax 543245712 --dcv 1024 Doing ahead-of-time memory usage test Passed! Constructing with these parameters: --bmax 543245712 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample Building sPrime Building sPrimeOrder V-Sorting samples Time to join reference sequences: 00:00:33 bmax according to bmaxDivN setting: 724327615 Using parameters --bmax 543245712 --dcv 1024 Doing ahead-of-time memory usage test Passed! Constructing with these parameters: --bmax 543245712 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample Building sPrime Building sPrimeOrder V-Sorting samples V-Sorting samples time: 00:01:49 Allocating rank array Ranking v-sort output V-Sorting samples time: 00:01:53 Allocating rank array Ranking v-sort output Ranking v-sort output time: 00:00:23 Invoking Larsson-Sadakane on ranks Ranking v-sort output time: 00:00:22 Invoking Larsson-Sadakane on ranks Invoking Larsson-Sadakane on ranks time: 00:00:34 Sanity-checking and returning Building samples Reserving space for 12 sample suffixes Generating random suffixes QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples (Using difference cover) Multikey QSorting samples time: 00:00:00 Calculating bucket sizes Splitting and merging Splitting and merging time: 00:00:00 Avg bucket size: 2.89731e+09 (target: 543245711) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering Ebwt loop Getting block 1 of 1 No samples; assembling all-inclusive block Invoking Larsson-Sadakane on ranks time: 00:00:33 Sanity-checking and returning Building samples Reserving space for 12 sample suffixes Generating random suffixes QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples (Using difference cover) Multikey QSorting samples time: 00:00:00 Calculating bucket sizes Splitting and merging Splitting and merging time: 00:00:00 Avg bucket size: 2.89731e+09 (target: 543245711) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering Ebwt loop Getting block 1 of 1 No samples; assembling all-inclusive block Sorting block of length 2897310462 for bucket 1 (Using difference cover) Sorting block of length 2897310462 for bucket 1 (Using difference cover) Parent process: Failed to build index

FelixKrueger commented 5 years ago

Hmm, to be honest I don't exactly know what the problem is there. Which version of Bowtie2 are you using? We recently had some issues with SMALL indexes and huge genomes (https://github.com/FelixKrueger/Bismark/issues/251), but this doesn't seem to be the problem here.

Is there a chance that your computing enviroment prevents you from launching sub-processes at all? How much memory do you have available, and which genome would you like to index?

In theory you should also be able to move to the CT and GA conversion sub-folders, and then run the commands in there yourself:

bowtie2-build -f genome_mfa.GA_conversion.fa BS_CT bowtie2-build -f genome_mfa.GA_conversion.fa BS_GA

jasonsaunderswilliams commented 5 years ago

Dear Felix,

Just to let you know it was a memory issue. Bowtie within the Bismark container I used failed to index. Fortunately I have access to a server with more RAM which handled the conversions. Thank you very much for your help.

Cheers,

Jason

FelixKrueger commented 5 years ago

Excellent, thanks for the feedback! I'm glad you've got it to work now.

jasonsaunderswilliams commented 5 years ago

Hi Felix, I hope you don't mind me continuing to use this thread for help with using Bismark. I do apologise if my questions are amateurish :) wet lab scientist transitioning. So last week I managed to perform the Bismark genome_preparation step using a container through a docker hub. Little confused about how the command line should look with the docker, but I have put the bisulfite converted files and the original genome in the folder I am specifying. However it comes up with the following error message... sudo docker run -ti --rm -v :/home/ubuntu/work/Jason/Bisulfite_Genomes_A -v /home/ubuntu/work/Jason/Bisulfite_Genomes_A:/inputfiles greatfireball/ime_bismark --bowtie2 -n 1 -l 50 /home/ubuntu/work/Jason/Bisulfite_Genomes_A/hg19.fa docker: Error response from daemon: invalid volume spec ":/home/ubuntu/work/Jason/Bisulfite_Genomes_A": invalid volume specification: ':/home/ubuntu/work/Jason/Bisulfite_Genomes_A'. See 'docker run --help'.

Everything up until bismark in the command relates to the docker. Any help would be very much appreciated.

Cheers,

Jason

jasonsaunderswilliams commented 5 years ago

I mean like this sorry, was missing "Bismark"

sudo docker run -ti --rm -v :/home/ubuntu/work/Jason/Bisulfite_Genomes_A -v /home/ubuntu/work/Jason/Bisulfite_Genomes_A:/inputfiles greatfireball/ime_bismark bismark --bowtie2 -n 1 -l 50 /home/ubuntu/work/Jason/Bisulfite_Genomes_A/hg19.fa docker: Error response from daemon: invalid volume spec ":/home/ubuntu/work/Jason/Bisulfite_Genomes_A": invalid volume specification: ':/home/ubuntu/work/Jason/Bisulfite_Genomes_A'. See 'docker run --help'.

FelixKrueger commented 5 years ago

Hi Jason,

I a,m sorry but I have never used Docker myself, so you might want to get the help in a Docker forum, or from the guys who supplied the Docker file?

Just to quickly mention that the options -n and -l were Bowtie 1 specific (whose support has been dropped in v0.21.0). The equivalent options in Bowtie2 mode are -N and -L, but I would simply use the defaults in a first run. Also, when you specify the genome for Bismark you do not give it a FastA file, but only the folder that contains the .fa file(s) as well as the folder Bisulfite_Genome, so in your case probably --genome /home/ubuntu/work/Jason/Bisulfite_Genomes_A/.

jasonsaunderswilliams commented 5 years ago

Thanks again Felix. I will feedback once I find the solution.

jasonsaunderswilliams commented 5 years ago

Hi Felix, all sorted now. The successful code using this docker in case it's any use for anyone using bismark through a container: sudo docker run -ti --rm -v /home/ubuntu/work/Jason/:/inputfiles greatfireball/ime_bismark bismark --genome /inputfiles/3944Swan_CS001_hg38.fastq.gz