Closed jws777 closed 3 years ago
There are some weird things going on. Maybe first of all, I am not sure if the TOPLEVEL file is generally a good choice, as it may contain haplotype and patch regions (which may lead to ambiguous alignments, and therefore result in sequences getting excluded).
---------
TOPLEVEL
---------
These files contains all sequence regions flagged as toplevel in an Ensembl
schema. This includes chromsomes, regions not assembled into chromosomes and
N padded haplotype/patch regions.
-----------------
PRIMARY ASSEMBLY
-----------------
Primary assembly contains all toplevel sequence regions excluding haplotypes
and patches. This file is best used for performing sequence similarity searches
where patch and haplotype sequences would confuse analysis. If the primary
assembly file is not present, that indicates that there are no haplotype/patch
regions, and the 'toplevel' file is equivalent.
For the mouse genome, I think they are both equivalent though, and both should work equally well.
On my end, using Bismark v0.23.0, Bowtie2 v2.4.2, and Samtools v1.11, the genome prep works just fine:
Bismark Genome Preparation - Step II: Bisulfite converting reference genome
conversions performed:
chromosome C->T G->A
1 39505587 39475953
2 37509143 37549551
3 31632630 31689758
4 32273771 32299184
5 31474546 31480386
6 30324207 30303520
7 30522166 30486722
8 26636031 26613974
9 25887273 25865951
10 26297538 26349838
11 26029159 26024038
12 24373800 24425370
13 24404130 24399021
14 24926453 24936665
15 21183326 21165928
16 19408269 19424734
17 19612158 19591811
18 18129221 18175890
19 12452755 12424366
X 32224667 32254982
Y 17244325 17161729
MT 3976 2013
JH584299.1 196679 194475
GL456233.2 96662 95939
JH584301.1 49630 52220
GL456211.1 53784 52384
GL456221.1 44248 45898
JH584297.1 42059 42163
JH584296.1 41022 40705
GL456354.1 40330 40258
JH584298.1 37547 38061
JH584300.1 35488 36307
GL456219.1 34883 35727
GL456210.1 37533 37162
JH584303.1 31726 31235
JH584302.1 30031 30742
GL456212.1 33862 33317
JH584304.1 26005 22257
GL456379.1 13182 13573
GL456366.1 8411 8579
GL456367.1 7202 8000
GL456239.1 8413 8616
GL456383.1 6954 4475
GL456385.1 6854 6900
GL456360.1 6402 6234
GL456378.1 6106 6240
MU069435.1 6707 6277
GL456389.1 5331 5344
GL456372.1 5466 5079
GL456370.1 3287 2675
GL456381.1 4370 5498
GL456387.1 4791 5863
GL456390.1 1930 2970
GL456394.1 4156 4651
GL456392.1 3685 4929
GL456382.1 4209 4285
GL456359.1 4277 4578
GL456396.1 3855 4663
GL456368.1 3755 3723
MU069434.1 2091 1960
JH584295.1 611 611
Total number of conversions performed:
C->T: 553008665
G->A: 553055957
So I suspect that it something really odd, such as the version of gunzip
on your system, or maybe the format of the incoming FastA file is somehow corrupt? Which platform are you running this on? Here are the commands I used:
wget http://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.toplevel.fa.gz
file Mus_musculus.GRCm39.dna.toplevel.fq.gz
Mus_musculus.GRCm39.dna.toplevel.fq.gz: gzip compressed data
(maybe you have some really odd file endings, and require dos2unix
or mac2unix
before running the command again?) and this was my Bismark indexing command:
bismark_genome_preparation . --verbose
Maybe it would be worth looking at the file endings first, then update potentially outdated software packages?
Thank you Felix - I had manually downloaded the genome file, unzipped it, and added it to the folder. I don't know why it would be different but when I did it using your commands then it worked perfectly and I have the same results as you showed above. Thanks so much for your quick response!
Excellent, I am glad it's all fine now. All the best
I have been trying to align some files to the mouse genome but when I come to the genome preparation step, I get this error (sorry there is so much nonsense):
Bismark Genome Preparation - Step II: Bisulfite converting reference genome
conversions performed: chromosome C->T G->A The specified chromosome ( ˎ,ɶ6l*NB@2{4 $:/Qi;<soo4<VU͚w;E??O???_/g0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000Bowtie 2 seems to be working fine (tested command 'bowtie2 --version' [2.3.5]) Output format is BAM (default)
The genome file I downloaded from ENSEMBL is Mus_musculus.GRCm39.dna.toplevel.fa
This is the code I am running:
verify which other packages available
which bowtie2 which samtools which fastqc which bismark
Run fastQC on .fastq files
fastqc SRR691421_RRBS.fastq
Quality and adapter trimming
trim_galore SRR691421_RRBS.fastq
Preparation for alignment with Bismark
bismark_genome_preparation --verbose /mnt/c/Users/my_lab/Documents/my_name/Beerman_et_al_raw_data/genome/
(And the .fa file is definitely in the folder specified). The QC and trimming all runs fine before this step.
Any help very much appreciated, thank you very much!