using PE+SE data together

ArnoudBoot86 commented 5 years ago

Hi Felix,

I have a questions concerning my data; I'm sorry if it's a stupid question, I'm new to WGBS data, and trying bismark to do the analysis.

I have WGBS data, PE, sequenced using the standard illumina protocol.

I started off with trimming using trimmomatic, to remove low quality reads, and remove any adapters in there (fastqc shows there's a clear improvement of quality when comparing before and after trimming)

the problem is that after trimming I have both PE and SE data. so I went ahead and aligned the PE data and SE data seperately (2 different commands).

My question is, can I now just merge these bams before doing the deduplicating (or deduplicate all bams together which would also lead to a single merged bam as the endpoint?) and proceeding with methylation extraction?

I'm asking this because the methylation_extractor itself detects PE or SE, but what if the input file is mixed? would this work?

Thank you Arnoud

FelixKrueger commented 5 years ago

Hi Arnoud,

If you have run PE and SE mapping separately you need to 'follow it through' to the methylation extraction phase. So to mapping -> deduplication -> methylation extraction for each PE, and then SE. Once you arrived at this stage you can run bismark2bedGraph on all CpG* context files together (SE + PE), and thereby arrive at a single coverage file for all mappings. The command should look something like this:

bismark2bedGraph -o merged_outfile.cov --buffer 10G CpG*

As a comment, you mentioned that you used 2 different commands, but the SE reads normally need to be aligned with 2 separate commands (one directional, one with --pbat), unless the alignments were non-directional to start with). Otherwise R2 should return a ~0% mapping efficiency.

ArnoudBoot86 commented 5 years ago

Hi Felix, thank you for the extremely quick reply :)

Thank you for the comment; I missed this; I just concatenated the SE together and aligned in one step. I did notice lower mapping efficiency for the SE, but I assumed that was due to less info for unique mapping. I will redo it for R1 and R2 reads seperately, thank you :D

ArnoudBoot86 commented 5 years ago

Hi Felix,

sorry to bother you again,

I am proceeding with the analysis; and am seeing a bit of a weird drop in the M-bias plot; with between the 133rd and 139th position in the read; there is a quite strong drop of methylation for some reason.

Have you seen this before? I'm thinking I probably have to hard-trim away the last 10bp to remove this potential bias, would you agree?

trimmed_PEmerged_dedup.multiple_M-bias.pdf

FelixKrueger commented 5 years ago

Hi Arnoud,

This drops looks somehow quite suspicious to me, quite possibly like some kind of artefact of the merging procedure.

So you have done PE mapping first, followed by paired-end deduplication and methylation extraction, followed by single-end (directional and --pbat) alignments, deduplication and methylation extraction of the unmapped files, and finally run bismark2bedGraph using the combined set of extracted cytosines?

What do the M-bias plots of the individual PE and SE files look like, does one of the 3 have same spike?

ArnoudBoot86 commented 5 years ago

Hi Felix,

This is actually the PE data only; I first wanted to check the individual parts before merging (the “merged” in the PE files is because I merged several fastq’s together during the deduplication of the PE data); here are the other 2; as you can see, both SE reads also show the same dip; but weirdly enough, for the SE1, there’s actually an increase at the same position for non-CpG sites

So I’m thinking there’s just something off at those positions in the reads. Unfortunately, due to the huge file size (the raw data is 145Gb per sample) I already deleted the intermediate files (I know, not very smart :P), I will rerun, see whether this problem is lane specific. trimmed_SE1_dedup_M-bias.pdf trimmed_SE2_dedup_M-bias.pdf

FelixKrueger commented 5 years ago

Hi Arnoud,

This is interesting. My guess would be that there is a bias at the start of Read 2, and you end up seeing the this at the 3' end of both single-end R1, as well as the PE data, because your typical insert length is probably in the range of something shorter than 150bp. To see your typical insert length you could simply import the PE BAM file into SeqMonk, and click on Read Length Histogram to find out.

Did you use the Accel Swift kit to generate the data by any chance? For this kit I remember seeing a fairly pronounced bias a the start of R2, which is why we recommend trimming the first (and last) 10bp of all reads before mapping https://github.com/FelixKrueger/Bismark/tree/master/Docs#ix-notes-about-different-library-types-and-commercial-kits.

A simple Trim Galore run should help you there: trim_galore --paired --clip_r1 10 --clip_r2 10 --three_prime_clip_r1 10 --three_prime_clip_r2 10 R1.fastq.gz R2.fastq.gz

FelixKrueger commented 5 years ago

One thing I wanted to mention but forgot is that while you see these drops or spikes (for non-CG methylation) in the data at the end, the actual number of calls is only a small fraction of the calls elsewhere in the read. This is a result of quality and/or adapter trimming, and is expected. So in essence, the problem, at least at the 3' end, probably looks worse than it actually is. But if you know the source of the bias I would probably still recommend removing it before even starting the alignments just to be on the safe side. Cheers, Felix

ArnoudBoot86 commented 5 years ago

Hi Felix,

I did some checking;

first of all; it's not lane specific; all fastq pairs show the same dip in the same proportions
I also did SE alignment for one of the paired fastq's; both read 1 and read 2 show the dip, although it's stronger in read 1 than in read 2 (see attached) (I did this just to double check; the SE reads also showed the same already; but since those are SE after trimming, I wanted to make sure it's the same in the reads that are still paired after trimming)
I checked the insert sizes; the average is 205, with quartiles at 164,194 and 234, so the insert length looks pretty to me

As for your questions; no, I didn't use the accel swift kit; it's WGBS data (illumina).

I'm checking with the sequencing company; for some samples I got 140bp data; it looks like they chopped off the first 10 bases, while for other samples I got the full 150bp; a bit weird

But I agreee with you that chopping off the last 10bp of the reads is probably the best option

PE_read1_mBias.pdf PE_read2_mBias.pdf

FelixKrueger commented 5 years ago

Hmm, alright, it was worth trying... Certainly for R2 I would also consider trimming from the 5' end to get rid the biases there. Let me know if you find something.

ArnoudBoot86 commented 5 years ago

Hi Felix,

thanks again for all your help.

I reran the analysis, with additional trimming to remove the artefact at the end of the reads, all the way up to the bedgraph. First thing I cheked was the coverage; I started with 145Gb raw data, hoping to get at least 20x for decent quantification of the methylation (normally I get 105Gb data with regular WGS, which gets me >30x coverage after trimming). surprisingly, the coverage inside the bedgraph was way below 20x (see histogram below); the median coverage of all CpG sites was only 9, and the average is 9.7...

To check whether the coverage was indeed that poor, I sorted and merged the bamfiles and did bamQC using qualimap; and this paints a much more positive picture;

Naturally I would understand that extremely CG rich regions would be covered less, but this is a tremendous difference.

Also, I checked whether there is a very strong difference in coverage difference between methylated and unmethylated sites (which could point to PCR amplification bias), but for both unmethylated and methylated sites (as per the bedGraph output), most CpG sites are just very lowly covered. (all in all, the distribution of the methylation looks like I would expect, fitting the ~48-49% methylation from the M-bias plot)

Do you know why the coverage is so extremely different? Is there some base or mapping quality filter inside the methylation_extractor that doesn't count some of the reads for example?

FelixKrueger commented 5 years ago

Hi Arnoud,

If I am not mistaken then the answer to that is probably quite simple: whole genome sequencing, or what you did with BamQC or qualimap on the BAM files takes the entire reads into account, whereas the methylation data in the coverage files is strand-specific. In other words, for any C on the top strand you can discard about half of all reads (as they are only informative for the bottom strand and its cytosines).

For a fairer comparison, I guess you would have to merge the calls of top and bottom strand cytosines in CpG context (you could do this using coverage2cytosine --merge_CpG) and then do your plot again. I am sure it would look much more like the coverage histogram from your merged.bam file.

To answer your specific questions: No there is no additional secret filtering going on within the methylation extractor. There are a few cases where you won't get methylation call at a C position in the genome (which are all already performed during the mapping step), they include the following:

a mutation or sequencing error at a C positions which are not C or T
context switching if a nearby deletion changes the sequence context, e.g. CGTTAT (CG-context) could become CTAT (CHH context) with a 2bp deletion immediately downstream of the C
an insertion immediately adjacent to the C would change the context to Unknown, which are subsequently filtered out

While these cases may occur occasionally, they would certainly not have a drastic effect (if they are measurable at all).

Would that make sense?

ArnoudBoot86 commented 5 years ago

o, yes, that makes sense, thank you

I tried to run the coverage2cytosine with the following command: coverage2cytosine -o consolidatedCpGcalls --genome_folder ~/ref_genomes/hg38 --merge_CpG --gzip merged_outfile.cov.gz &

I get a continuous error of: (for every line of the input it processes) Use of uninitialized value $meth in join or string at /home/gmsarno/scr/bismark_v0.21.0/coverage2cytosine line 340, line 4504746. Use of uninitialized value $nonmeth in join or string at /home/gmsarno/scr/bismark_v0.21.0/coverage2cytosine line 340, line 4504746.

and the resulting file (consolidatedCpGcalls.CpG_report.txt.gz) has 0 reads for every CpG.

any tips?

FelixKrueger commented 5 years ago

I can see that your previous message was edited, but the answer would probably have been: yes, some people believe it makes sense to combine top and bottom strand calls for CpG dinucleotides as they should be methylated symmetrically (which is obviously not true for hemi-methylated positions where the copying mechanism is lagging behind or similar).

Regarding the error above, it seems to indicate that the overage file you feed it as input is either empty, truncated or similar. Could you do a zcat merged_outfile.cov.gz | head or zcat merged_outfile.cov.gz | tail to see if the file looks OK. If it doesn't work, would you mind sending the the file, or part of the file and the exact output of the entire coverage2cytosine process via email?

ArnoudBoot86 commented 5 years ago

Hi,

I removed the previous message, because after re-reading your message I understood the "merge_CpG" option merged the methylation calls from the top and bottom strands (and for my particular question hemi-methylation isn't a variable I am considering due to the nature of the samples).

As for the input file, as far as I can tell it's fine:

I then tried to run the coverage2cytosine on the first 1000 lines of this file; again giving a similar error (although now it's complaining about line 569 in the coverage2cytosine script).

I can't share the full output through email, due to size (it's 50Mb and growing, and seems to insist on listing all the genome positions in the reference genome, regardless of whether it was covered or not, and it seems there's also lots of non-CpG positions listed???), but the input and first 1000 rows of output are attached output.txt input.txt.gz

FelixKrueger commented 5 years ago

Hi Arnoud,

What you listed above is not a coverage file, but it is in bedGraph format. The coverage file has two additional columns with count methylated and unmethylated. The methylation extractor generated both a .cov.gz and a .bedGraph.gz file by default. Do you still have have the real coverage file floating around?

track type=bedGraph
chr1    10468   10469   100
chr1    10470   10471   100
chr1    10483   10484   100
chr1    10488   10489   100

ArnoudBoot86 commented 5 years ago

oops :D

thanks, that works, thank you 👍

any reason why the output also has lines for every position behind a CpG?

FelixKrueger commented 5 years ago

The coverage2cytosine command first produces a regular (single-C resolution) CpG output file, the the --merge_CpG file will then use this file as input, to generate an additional CpG-dinucleotide file (I believe in coverage format file).

FelixKrueger commented 5 years ago

Closing this as there don't seem to be any new developments.

FelixKrueger / Bismark

using PE+SE data together #249