FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
388 stars 101 forks source link

Bismark genome preparation: Parent process: Failed to build index #198

Closed m3lorra closed 6 years ago

m3lorra commented 6 years ago

Hi, I'm using bismark_genome_preparation to in-silico bisulfite convert the reference genome.

wget ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
gunzip Homo_sapiens.GRCh38.dna.toplevel.fa.gz
bismark_genome_preparation --bowtie1 ../data/GRCh38

but I'm getting this message: Parent process: Failed to build index

Any clue of what might be wrong? Bismark v0.20.0

Thank you

Writing bisulfite genomes out into a single MFA (multi FastA) file

Bisulfite Genome Indexer version v0.20.0 (last modified 26 April 2018)

A directory called /home/steph/Datasets/data/GRCh38/Bisulfite_Genome/ already exists. Bisulfite converted sequences and/or already existing Bowtie (1 or 2) indices will be overwritten!

Step I - Prepare genome folders - completed

Total number of conversions performed:
C->T:   628657194
G->A:   631423128

Step II - Genome bisulfite conversions - completed

Bismark Genome Preparation - Step III: Launching the Bowtie (1) indexer
Please be aware that this process can - depending on genome size - take several hours!
Settings:
  Output files: "BS_CT.*.ebwtl"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 5 (one in 32)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  genome_mfa.CT_conversion.fa
Reading reference sizes
Settings:
  Output files: "BS_GA.*.ebwtl"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 5 (one in 32)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  genome_mfa.GA_conversion.fa
Reading reference sizes
  Time reading reference sizes: 00:09:31
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time reading reference sizes: 00:09:33
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:13:28
bmax according to bmaxDivN setting: 768176187
Using parameters --bmax 576132141 --dcv 1024
  Doing ahead-of-time memory usage test
  Time to join reference sequences: 00:13:32
bmax according to bmaxDivN setting: 768176187
Using parameters --bmax 576132141 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 576132141 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  Passed!  Constructing with these parameters: --bmax 576132141 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  V-Sorting samples
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:04:27
  Allocating rank array
  Ranking v-sort output
  V-Sorting samples time: 00:04:27
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:41
  Invoking Larsson-Sadakane on ranks
  Ranking v-sort output time: 00:00:41
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:01:12
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 3.0727e+09 (target: 576132140)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Invoking Larsson-Sadakane on ranks time: 00:01:11
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 3.0727e+09 (target: 576132140)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Sorting block of length 3072704749 for bucket 1
  (Using difference cover)
Parent process: Failed to build index
FelixKrueger commented 6 years ago

Hmm, from the on screen text I can't exactly see what the reason might be. Do you have enough RAM available on the machine you are using? Just using the assembled Hunan chromosomes you will need code to 10GB of memory, but I think the top-level data files contain a LOT more sequences (I think it might even need on excess of 20GB...).

Could you try to wget only the chromosomal (and possibly non-chromosomal) sequences and try with those? Also, you could try to index with Bowtie2 instead (the default) and see if that works. Is there any reason why you want to use Bowtie 1 instead? These days it is very much recommended to use Bowtie2 because of its ability to map gapped alignments and because it scares with longer read lengths. Cheers, Felix

m3lorra commented 6 years ago

Hi Felix, Thanks a lot, I'll try what you suggest and I'll let you know. I'm using bowtie1 just because I'm doing RRBS analysis and every protocol that I have found use it. I'll try with bowtie2.

Thank you!

sunshine-lp0 commented 6 years ago

Hi, What is the impact of choosing a different reference genome? Which one should I choose? Ensembl:

  1. Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz?
  2. Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa?
  3. Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz?
FelixKrueger commented 6 years ago

The genome README file explains the different options:

<sequence type>:
  * 'dna' - unmasked genomic DNA sequences.
  * 'dna_rm' - masked genomic DNA.  Interspersed repeats and low
     complexity regions are detected with the RepeatMasker tool and masked
     by replacing repeats with 'N's.
  * 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions
    have been replaced with lowercased versions of their nucleic base

You should not use the hard-masked genome version (dna_rm) since this will convert half the genome into Ns. Soft-masking should not affect the process because as far as I know all sequence is converted to upper case for the genome preparation etc. anyway.

sunshine-lp0 commented 6 years ago

Thanks very much!!!!! I'll try Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz.

sunshine-lp0 commented 6 years ago

Dear Felix Krueger, I'm sorry to bother you again! After running bismark, there are some “extra chromosomes” in the report(Below). Should I remove these sequences from genome reference ? And, I don't know how to remove it either. I'm sorry. I don't want to take up too much of your time. Thanks.

FastQ format specified chr 1 (248956422 bp) chr 10 (133797422 bp) chr 11 (135086622 bp) chr 12 (133275309 bp) chr 13 (114364328 bp) chr 14 (107043718 bp) chr 15 (101991189 bp) chr 16 (90338345 bp) chr 17 (83257441 bp) chr 18 (80373285 bp) chr 19 (58617616 bp) chr 2 (242193529 bp) chr 20 (64444167 bp) chr 21 (46709983 bp) chr 22 (50818468 bp) chr 3 (198295559 bp) chr 4 (190214555 bp) chr 5 (181538259 bp) chr 6 (170805979 bp) chr 7 (159345973 bp) chr 8 (145138636 bp) chr 9 (138394717 bp) chr MT (16569 bp) chr X (156040895 bp) chr Y (57227415 bp) chr KI270728.1 (1872759 bp) chr KI270727.1 (448248 bp) chr KI270442.1 (392061 bp) chr KI270729.1 (280839 bp) chr GL000225.1 (211173 bp) chr KI270743.1 (210658 bp) chr GL000008.2 (209709 bp) chr GL000009.2 (201709 bp) chr KI270747.1 (198735 bp) chr KI270722.1 (194050 bp) chr GL000194.1 (191469 bp) chr KI270742.1 (186739 bp) chr GL000205.2 (185591 bp) chr GL000195.1 (182896 bp) chr KI270736.1 (181920 bp) chr KI270733.1 (179772 bp) chr GL000224.1 (179693 bp) chr GL000219.1 (179198 bp) chr KI270719.1 (176845 bp) chr GL000216.2 (176608 bp) chr KI270712.1 (176043 bp) chr KI270706.1 (175055 bp) chr KI270725.1 (172810 bp) chr KI270744.1 (168472 bp) chr KI270734.1 (165050 bp) chr GL000213.1 (164239 bp) chr GL000220.1 (161802 bp) chr KI270715.1 (161471 bp) chr GL000218.1 (161147 bp) chr KI270749.1 (158759 bp) chr KI270741.1 (157432 bp) chr GL000221.1 (155397 bp) chr KI270716.1 (153799 bp) chr KI270731.1 (150754 bp) chr KI270751.1 (150742 bp) chr KI270750.1 (148850 bp) chr KI270519.1 (138126 bp) chr GL000214.1 (137718 bp) chr KI270708.1 (127682 bp) chr KI270730.1 (112551 bp) chr KI270438.1 (112505 bp) chr KI270737.1 (103838 bp) chr KI270721.1 (100316 bp) chr KI270738.1 (99375 bp) chr KI270748.1 (93321 bp) chr KI270435.1 (92983 bp) chr GL000208.1 (92689 bp) chr KI270538.1 (91309 bp) chr KI270756.1 (79590 bp) chr KI270739.1 (73985 bp) chr KI270757.1 (71251 bp) chr KI270709.1 (66860 bp) chr KI270746.1 (66486 bp) chr KI270753.1 (62944 bp) chr KI270589.1 (44474 bp) chr KI270726.1 (43739 bp)

FelixKrueger commented 6 years ago

These sequences are contigs that could not be placed on individual chromosomes. I wouldn't necessarily remove them, but maybe you can simply focus on analysing the data from chromsosomes 1-22, X,Y and MT? It is good to include these sequences for the mapping process, but when we import data into Seqmonk afterwards it simply doesn't import data from unplaced contigs at all.

sunshine-lp0 commented 6 years ago

Thank you for your patience and help!!!!

Lesky47 commented 2 years ago

Hi Felix, Thanks a lot, I'll try what you suggest and I'll let you know. I'm using bowtie1 just because I'm doing RRBS analysis and every protocol that I have found use it. I'll try with bowtie2.

Thank you!

Hello m3lorra,

I'm experiencing the same problem as you. Would you mind sharing how did you solve that please? Thank you for your kind reply!

xiaofeiwei1 commented 2 years ago

Hi Felix, Thanks a lot, I'll try what you suggest and I'll let you know. I'm using bowtie1 just because I'm doing RRBS analysis and every protocol that I have found use it. I'll try with bowtie2. Thank you!

Hello m3lorra,

I'm experiencing the same problem as you. Would you mind sharing how did you solve that please? Thank you for your kind reply!

Hello Lesky47,

I am having the same problem. Did you figure out what caused it?

Regards, xiaofei