FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

Fail to make genome ref for bismark alignment #164

Closed gbnci closed 6 years ago

gbnci commented 6 years ago

I have downloaded Possum genome sequence from Broad ("Monodelphis_domestica.monDom5.dna.toplevel.fa") and ran genome preparation command as below: "bismark_genome_preparation --bowtie2 . " with "fa" file in the current directory. From the resulting folder "Bisulfite_Genome", two directories were created ("CT_conversion" and "GA_conversion"), but under each one only five files were generated ("BS_CT.1.bt2 BS_CT.2.bt2 BS_CT.3.bt2 BS_CT.4.bt2 genome_mfa.CT_conversion.fa" for CT_conversion and "BS_GA.1.bt2 BS_GA.2.bt2 BS_GA.3.bt2 BS_GA.4.bt2 genome_mfa.GA_conversion.fa" for GA_conversion. While running bismark alignment, I got the following error: Alignments will be written out in BAM format. Samtools found here: '/usr/local/apps/samtools/1.6/bin/samtools' Reference genome folder provided is /scratch/wangyong/possum/ (absolute path is '/spin1/scratch/wangyong/possum/)' The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.rev.1.bt2'). Please run the bismark_genome_preparat ion before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.rev.2.bt2'). Please run the bismark_genome_preparat ion before running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.rev.1.bt2'). Please run bismark_genome_preparation before running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.rev.2.bt2'). Please run bismark_genome_preparation before running Bismark

Couldn't find a traditional small Bowtie 2 index for the genome specified (ending in .bt2). Now searching for a large index instead...... Seems to me I failed to create the genome necessary for the alignment, could you please give me any suggestions about this. I am using bismark 0.19.0 and bowtie 2-2.3.4 Thanks

FelixKrueger commented 6 years ago

Hi @gbnci,

I just looked and I have downloaded the Opossum genome in 2015 (build BROAD05) already, and it produced the following 7 files in both the CT and GA genome conversion folders:

-rw-rw-r-- 1 fkrueger bioinf 1172288552 Jun 16  2015 BS_CT.1.bt2
-rw-rw-r-- 1 fkrueger bioinf  875415080 Jun 16  2015 BS_CT.2.bt2
-rw-rw-r-- 1 fkrueger bioinf     655316 Jun 16  2015 BS_CT.3.bt2
-rw-rw-r-- 1 fkrueger bioinf  875415075 Jun 16  2015 BS_CT.4.bt2
-rw-rw-r-- 1 fkrueger bioinf 1172288552 Jun 16  2015 BS_CT.rev.1.bt2
-rw-rw-r-- 1 fkrueger bioinf  875415080 Jun 16  2015 BS_CT.rev.2.bt2
-rw-rw-r-- 1 fkrueger bioinf 3665725775 Jun 16  2015 genome_mfa.CT_conversion.fa

So you definitely need 6 files ending in .bt2. If I would have to guess I would suspect that you a) didn't wait long enough for the indexing to complete (judging by the genome size of 3.6GB I would expect it to take between 2 and 4 hours), or b) that you didn't give it enough memory to work with? Parallel indexing will probably take at least 9GB or more RAM.

Is either of those possible?

I just downloaded the genome you mentioned (Monodelphis_domestica.monDom5.dna.toplevel.fa) from Ensembl (couldn't find that very file from the Broad website), and am indexing as we speak. I'll update tomorrow if it didn't complete successfully for any reason. So far it seems to have started fine:

Bismark Genome Preparation - Step II: Bisulfite converting reference genome

conversions performed:
chromosome  C->T    G->A
1   138636207   138674540
2   100127924   100132406
3   95991212    96049612
4   79638925    79580201
5   55208532    55150288
6   54020638    54008284
7   46697639    46721056
8   57496204    57378070
X   14957212    14986487
MT  3776    2201
Un  19434538    19520964

Total number of conversions performed:
C->T:   662212807
G->A:   662204109
gbnci commented 6 years ago

Hi, FelixKrueger: Thanks for your suggestion. I think you are right. When I generated the genome, it only took a few minutes and seems to me has finished, and two of the 5 files have zero size. I only used 8G ram for the processing. I am trying right now and will update on the website tomorrow. Thanks Y Wang

From: FelixKrueger notifications@github.com Reply-To: FelixKrueger/Bismark reply@reply.github.com Date: Tuesday, March 20, 2018 at 5:59 PM To: FelixKrueger/Bismark Bismark@noreply.github.com Cc: "Wang, Yonghong (NIH/NCI) [E]" wangyong@mail.nih.gov, Mention mention@noreply.github.com Subject: Re: [FelixKrueger/Bismark] Fail to make genome ref for bismark alignment (#164)

Hi @gbncihttps://github.com/gbnci,

I just looked and I have downloaded the Opossum genome in 2015 (build BROAD05) already, and it produced the following 7 files in both the CT and GA genome conversion folders:

-rw-rw-r-- 1 fkrueger bioinf 1172288552 Jun 16 2015 BS_CT.1.bt2

-rw-rw-r-- 1 fkrueger bioinf 875415080 Jun 16 2015 BS_CT.2.bt2

-rw-rw-r-- 1 fkrueger bioinf 655316 Jun 16 2015 BS_CT.3.bt2

-rw-rw-r-- 1 fkrueger bioinf 875415075 Jun 16 2015 BS_CT.4.bt2

-rw-rw-r-- 1 fkrueger bioinf 1172288552 Jun 16 2015 BS_CT.rev.1.bt2

-rw-rw-r-- 1 fkrueger bioinf 875415080 Jun 16 2015 BS_CT.rev.2.bt2

-rw-rw-r-- 1 fkrueger bioinf 3665725775 Jun 16 2015 genome_mfa.CT_conversion.fa

So you definitely need 6 files ending in .bt2. If I would have to guess I would suspect that you a) didn't wait long enough for the indexing to complete (judging by the genome size of 3.6GB I would expect it to take between 2 and 4 hours), or b) that you didn't give it enough memory to work with? Parallel indexing will probably take at least 9GB or more RAM.

Is either of those possible?

I just downloaded the genome you mentioned (Monodelphis_domestica.monDom5.dna.toplevel.fa) from Ensembl (couldn't find that very file from the Broad website), and am indexing as we speak. I'll update tomorrow if it didn't complete successfully for any reason. So far it seems to have started fine:

Bismark Genome Preparation - Step II: Bisulfite converting reference genome

conversions performed:

chromosome C->T G->A

1 138636207 138674540

2 100127924 100132406

3 95991212 96049612

4 79638925 79580201

5 55208532 55150288

6 54020638 54008284

7 46697639 46721056

8 57496204 57378070

X 14957212 14986487

MT 3776 2201

Un 19434538 19520964

Total number of conversions performed:

C->T: 662212807

G->A: 662204109

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/164#issuecomment-374771453, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aj2o-H7VabbnAMpSEQDQVoDbyFCrJC_Fks5tgXuHgaJpZM4SyhBv.

gbnci commented 6 years ago

It finished in a minute again even I used 32G RAM with the same output I got before. Here is the output: Writing bisulfite genomes out into a single MFA (multi FastA) file

Bisulfite Genome Indexer version v0.19.0 (last modified 07 November 2016)

Step I - Prepare genome folders - completed

Total number of conversions performed: C->T: 662212807 G->A: 662204109

Step II - Genome bisulfite conversions - completed

Bismark Genome Preparation - Step III: Launching the Bowtie 2 indexer Please be aware that this process can - depending on genome size - take several hours! Settings: Output files: "BS_CT..bt2" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void:8, int:4, long:8, size_t:8 Input files DNA, FASTA: genome_mfa.CT_conversion.fa Building a SMALL index Reading reference sizes Settings: Output files: "BS_GA..bt2" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void:8, int:4, long:8, size_t:8 Input files DNA, FASTA: genome_mfa.GA_conversion.fa Building a SMALL index Reading reference sizes Time reading reference sizes: 00:00:59 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time reading reference sizes: 00:00:59 Calculating joined length Writing header Reserving space for joined string Joining reference sequences

I am guessing the genome fasta file I downloaded may cause the problem. Will try to download file from Ensembl as you suggested to see whether it will work or not. Thanks

From: FelixKrueger notifications@github.com Reply-To: FelixKrueger/Bismark reply@reply.github.com Date: Tuesday, March 20, 2018 at 5:59 PM To: FelixKrueger/Bismark Bismark@noreply.github.com Cc: "Wang, Yonghong (NIH/NCI) [E]" wangyong@mail.nih.gov, Mention mention@noreply.github.com Subject: Re: [FelixKrueger/Bismark] Fail to make genome ref for bismark alignment (#164)

Hi @gbncihttps://github.com/gbnci,

I just looked and I have downloaded the Opossum genome in 2015 (build BROAD05) already, and it produced the following 7 files in both the CT and GA genome conversion folders:

-rw-rw-r-- 1 fkrueger bioinf 1172288552 Jun 16 2015 BS_CT.1.bt2

-rw-rw-r-- 1 fkrueger bioinf 875415080 Jun 16 2015 BS_CT.2.bt2

-rw-rw-r-- 1 fkrueger bioinf 655316 Jun 16 2015 BS_CT.3.bt2

-rw-rw-r-- 1 fkrueger bioinf 875415075 Jun 16 2015 BS_CT.4.bt2

-rw-rw-r-- 1 fkrueger bioinf 1172288552 Jun 16 2015 BS_CT.rev.1.bt2

-rw-rw-r-- 1 fkrueger bioinf 875415080 Jun 16 2015 BS_CT.rev.2.bt2

-rw-rw-r-- 1 fkrueger bioinf 3665725775 Jun 16 2015 genome_mfa.CT_conversion.fa

So you definitely need 6 files ending in .bt2. If I would have to guess I would suspect that you a) didn't wait long enough for the indexing to complete (judging by the genome size of 3.6GB I would expect it to take between 2 and 4 hours), or b) that you didn't give it enough memory to work with? Parallel indexing will probably take at least 9GB or more RAM.

Is either of those possible?

I just downloaded the genome you mentioned (Monodelphis_domestica.monDom5.dna.toplevel.fa) from Ensembl (couldn't find that very file from the Broad website), and am indexing as we speak. I'll update tomorrow if it didn't complete successfully for any reason. So far it seems to have started fine:

Bismark Genome Preparation - Step II: Bisulfite converting reference genome

conversions performed:

chromosome C->T G->A

1 138636207 138674540

2 100127924 100132406

3 95991212 96049612

4 79638925 79580201

5 55208532 55150288

6 54020638 54008284

7 46697639 46721056

8 57496204 57378070

X 14957212 14986487

MT 3776 2201

Un 19434538 19520964

Total number of conversions performed:

C->T: 662212807

G->A: 662204109

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/164#issuecomment-374771453, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aj2o-H7VabbnAMpSEQDQVoDbyFCrJC_Fks5tgXuHgaJpZM4SyhBv.

FelixKrueger commented 6 years ago

It seems that your process still hadn't finished yesterday. Over here it took ~13GB of memory and close to 6 hours for the indexing:

 System Time      = 01:06:54
 Wallclock Time   = 05:47:21
 CPU              = 10:36:16
 Max vmem         = 12.963G
 Exit Status      = 0

The files are identical (to the byte) to the older build from the Broad:

-rw-r--r-- 1 fkrueger bioinf 1172288552 Mar 21 01:07 BS_CT.1.bt2
-rw-r--r-- 1 fkrueger bioinf  875415080 Mar 21 01:07 BS_CT.2.bt2
-rw-r--r-- 1 fkrueger bioinf     655316 Mar 20 21:54 BS_CT.3.bt2
-rw-r--r-- 1 fkrueger bioinf  875415075 Mar 20 21:54 BS_CT.4.bt2
-rw-r--r-- 1 fkrueger bioinf 1172288552 Mar 21 03:35 BS_CT.rev.1.bt2
-rw-r--r-- 1 fkrueger bioinf  875415080 Mar 21 03:35 BS_CT.rev.2.bt2
-rw-r--r-- 1 fkrueger bioinf 3665725775 Mar 20 21:53 genome_mfa.CT_conversion.fa

So I am hoping that this morning everything should just work for you.

Cheers, Felix

gbnci commented 6 years ago

Hi, FelixKrueger Thanks for your troubleshooting. I think there must be some kinds of setting issues on my side that prevent the run from finishing as it always stops in about a minute. While I am still doing my troubleshooting here, I am wondering whether I can get the files you just generated to facilitate my analysis here. If OK, I can send you a link for you to upload the files to me. Thanks again for the help。 Best regard Y wang

From: FelixKrueger notifications@github.com Reply-To: FelixKrueger/Bismark reply@reply.github.com Date: Wednesday, March 21, 2018 at 5:54 AM To: FelixKrueger/Bismark Bismark@noreply.github.com Cc: "Wang, Yonghong (NIH/NCI) [E]" wangyong@mail.nih.gov, Mention mention@noreply.github.com Subject: Re: [FelixKrueger/Bismark] Fail to make genome ref for bismark alignment (#164)

It seems that your process still hadn't finished yesterday. Over here it took ~13GB of memory and close to 6 hours for the indexing:

System Time = 01:06:54

Wallclock Time = 05:47:21

CPU = 10:36:16

Max vmem = 12.963G

Exit Status = 0

The files are identical (to the byte) to the older build from the Broad:

-rw-r--r-- 1 fkrueger bioinf 1172288552 Mar 21 01:07 BS_CT.1.bt2

-rw-r--r-- 1 fkrueger bioinf 875415080 Mar 21 01:07 BS_CT.2.bt2

-rw-r--r-- 1 fkrueger bioinf 655316 Mar 20 21:54 BS_CT.3.bt2

-rw-r--r-- 1 fkrueger bioinf 875415075 Mar 20 21:54 BS_CT.4.bt2

-rw-r--r-- 1 fkrueger bioinf 1172288552 Mar 21 03:35 BS_CT.rev.1.bt2

-rw-r--r-- 1 fkrueger bioinf 875415080 Mar 21 03:35 BS_CT.rev.2.bt2

-rw-r--r-- 1 fkrueger bioinf 3665725775 Mar 20 21:53 genome_mfa.CT_conversion.fa

So I am hoping that this morning everything should just work for you.

Cheers, Felix

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/164#issuecomment-374883662, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aj2o-BI2uHfV6eF6Fde14beozUaKCYdtks5tgiNEgaJpZM4SyhBv.

FelixKrueger commented 6 years ago

This is certainly possible, here are the details for the files (active for 3 days):

Connection Details

Hostname ftp2.babraham.ac.uk Username ftpusr41 Password p7n8GbKA FTP URL ftp://ftpusr41:p7n8GbKA@ftp2.babraham.ac.uk

Cheers, Felix

gbnci commented 6 years ago

Thanks. Will do it as soon as possible.

From: FelixKrueger notifications@github.com Reply-To: FelixKrueger/Bismark reply@reply.github.com Date: Wednesday, March 21, 2018 at 12:29 PM To: FelixKrueger/Bismark Bismark@noreply.github.com Cc: "Wang, Yonghong (NIH/NCI) [E]" wangyong@mail.nih.gov, Mention mention@noreply.github.com Subject: Re: [FelixKrueger/Bismark] Fail to make genome ref for bismark alignment (#164)

This is certainly possible, here are the details for the files (active for 3 days):

Connection Details

Hostname ftp2.babraham.ac.uk Username ftpusr41 Password p7n8GbKA FTP URL ftp://ftpusr41:p7n8GbKA@ftp2.babraham.ac.uk

Cheers, Felix

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/164#issuecomment-375006388, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aj2o-KbvfkmV1c-JeL2EkVbwWHuTT_hwks5tgn_tgaJpZM4SyhBv.

FelixKrueger commented 6 years ago

It turned out that the culprit was an additional .fa file in the folder that lead to the apparent duplication of file names. All seems to be working now.