Faulty converted genome and sam merger issues

Jasmine0198 commented 6 years ago

Hi Felix, I have a large and very fragmented genome that I'm trying to use to analyse wgbs data, which I suspect might be the sole reason I am having so much trouble running bismark. I thought I would ask about two specific problems I'm having just in case you may be able to help me with them. The first is that when I run bismark_genome_preparation I get no error message, and I have GA_conversion folder and CT_version folder that have 7 files of the same size (as expected) but when I run bismark I always get the error message;

The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.1.bt2'). Please run the bismark_genome_preparation before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.2.bt2'). Please run the bismark_genome_preparation before running Bismark etc.

I think ultimately this is something to do with how fragmented my genome is but I still get an output from Bismark, just not sure if I can trust it based on these errors.

The second issue is that when I run the command on a subset of wgbs reads with 13 million sequences with the below command, I get two sam.gz files that haven't merged and an empty bam file.

/iridis4-home/js4u13/src/bismark/Bismark_v0.19.0/bismark -L 20 -N 1 --score_min L,0,-0.6 -p 4 --parallel 2 --path_to_bowtie /local/software/biobuilds/2017.11/bin /scratch/js4u13/WGS_23_10_17/CA9-3 -q /scratch/js4u13/WGS_23_10_17/CA9-3/1CA9_3_S5_L004_R1_001.tagged_filter.trimmomatic_out.fastq 2>&1 | tee -a 1CA9_3_S5_L004_R1_001.tagged_filter.trimmomatic_out.log

The message at the end of the log file is:

Now merging BAM files 1CA9_3_S5_L004_R1_001.tagged_filter.trimmomatic_out.fastq.temp.1_bismark_bt2.bam 1CA9_3_S5_L004_R1_001.tagged_filter.trimmomatic_out.fastq.temp.2_bismark_bt2.bam into >>> 1CA9_3_S5_L004_R1_001.tagged_filter.trimmomatic_out_bismark_bt2.bam <<< Use of uninitialized value $samtools_path in concatenation (.) or string at /iridis4-home/js4u13/src/bismark/Bismark_v0.19.0/bismark line 1354, line 60390008. Merging from file >> 1CA9_3_S5_L004_R1_001.tagged_filter.trimmomatic_out.fastq.temp.1_bismark_bt2.bam << Use of uninitialized value $samtools_path in concatenation (.) or string at /iridis4-home/js4u13/src/bismark/Bismark_v0.19.0/bismark line 1368, line 60390008. Merging from file >> 1CA9_3_S5_L004_R1_001.tagged_filter.trimmomatic_out.fastq.temp.2_bismark_bt2.bam << Use of uninitialized value $samtools_path in concatenation (.) or string at /iridis4-home/js4u13/src/bismark/Bismark_v0.19.0/bismark line 1365.

Is there a way that I can merge this sam files manually or do you think I can't trust this output? Any guidance would be very much appreciated, Jasmine

FelixKrueger commented 6 years ago

Hi @Jasmine0198 ,

Regarding the indexing part of your question:

Can you please link the commands you used for the genome preparation and the subsequent Bismark command so we can spot potential errors in there. You said you the genome preparation completed without errors, so I assume that you allowed enough time for it to complete?

When you get the messages

The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.1.bt2'). Please run the bismark_genome_preparation before running Bismark

Bismark should normally die straight away and not produce any output. Can you please list the files in the directory you use as reference folder for the Bismark run? Just as a reminder, you the reference folder has to contain the .fa files of your genome, and Bismark will use the Bisulfite_Genome structure automatically.

bismark --genome reference_folder file.fastq.gz

reference_folder
        Bisulfite_Genome
                CT_conversion
                GA_conversion
chr1.fa
chr2.fa
...

Regarding the second part of your question

I think you all you need to do is to give Bismark the path to your Samtools installation to fix this. Please either install Samtools and put it in your PATH environment, or specify the path with the option --samtools_path. Regarding the command you are using I would probably drop -N 1 and set score_min to --score_min L,0,-0.4 to not be overly relaxed for the mapping parameters, at least in the first instance unless you have a mapping efficiency problem.

Please let me know how you are getting on.

Jasmine0198 commented 6 years ago

Hi Felix, Thanks very much for your quick reply. The command that I’m using for genome preparation is:

/iridis4-home/js4u13/src/bismark/Bismark_v0.19.0/bismark_genome_preparation -p 32 --bowtie2 --path_to_bowtie /scratch/js4u13/src/bowtie2-2.3.4 --verbose /scratch/js4u13/WGS_23_10_17/CA9-3

Within the folder /scratch/js4u13/WGS_23_10_17/CA9-3 I then have: -1CA9_3_S5_L004_R1_001.tagged_filter.trimmomatic_out.fastq -CA9-3.fa -CA9-3.bam -CA9-3.sam -CA9-3_alignment.log -Bisulfite_Genome -CT_conversion -BS_CT.rev.1.bt2l -BS_CT.rev.2.bt2l -BS_CT.1.bt2l -BS_CT.2.bt2l -BS_CT.3.bt2l -BS_CT.4.bt2l -genome_mfa.CT_conversion.fa -GA_conversion -BS_GA.rev.1.bt2l -BS_GA.rev.2.bt2l -BS_GA.1.bt2l -BS_GA.2.bt2l -BS_GA.3.bt2l -BS_GA.4.bt2l -genome_mfa.GA_conversion.fa

The fastq file is my wgbs reads. I have whole genome sequencing reads for each sample as well as whole genome bisulfite sequencing reads, so I’ve been aligning the wgs reads to the fragmented genome I have with bowtie2 and then converting the sam output file to an fa file and using that for the bismark_genome_preparation script. I’m guessing the problem is here, perhaps I’m just going about it all wrong? Yes the genome preparation script completes without errors and I don’t set a time limit so it should just run to completion unless it runs out of memory (but then I expect I would get an error?) Thanks for your advice regarding the second part of my question, I’ll implement your suggestions and let you know when that completes. Many thanks again, Jasmine

FelixKrueger commented 6 years ago

Hi Jasmine,

Ah right, I can see that the genome preparation generate large indexes (as denoted by the bt2l file endings). In this case it should be fine that you see the message The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.1.bt2'), but a few lines further down you should have seen a message saying that it did find large indexes, is that right? So yea, if it completes and you don't run out of resources, the resulting files should be fine.

I am not exactly sure if I understand your procedure correctly, but why don't you use your 'fragmented_genome' (presumably a .fa file as well?) for both the Bowtie2 and Bismark mapping? Assuming you are in the folder containing the .fa file(s), the commands would be something like this:

bowtie2-build fragmented_genome.fa fragmented_genome_index and bismark_genome_preparation --verbose .

Where and why does the SAM to FastA step come into play here? Let me know if I can be of any additional help.

Jasmine0198 commented 6 years ago

Hi Felix, This is a relief to hear! Great, I’ll just continue to use this output. Yeeeah, I’ve only just really thought about how this step is redundant. We have the wgs data so that we can look for SNPs and distinguish incomplete bisulfite conversion int the wgbs reads, but it’s only just occurred to me that I don’t need an assembled version of the WGS data to do this...doh! Thanks again for your input, Jasmine

FelixKrueger / Bismark

Faulty converted genome and sam merger issues #172

Regarding the indexing part of your question:

Regarding the second part of your question