elsasserlab / minute

MINUTE-ChIP data analysis workflow
https://minute.readthedocs.io
MIT License
1 stars 0 forks source link

Trim/mask read2 sequencing poly-G artifacts #107

Closed cnluzon closed 2 years ago

cnluzon commented 3 years ago

We have recently realised that we get some artifactual peaks due to sequencing artifacts. Sometimes read2 contains a poly-G tail that is wrongly mapped to a specific genome region that has in fact a long G sequence. We know that the origin of this is the sequencing platform and would like to clean up those reads as well.

My guess is that these artifactual poly-G sequences could probably be distinguished from real sequences if they always appear at the end of read2 and base calling quality maybe drops. cutadapt can deal with these types of sequences as much as it did with the adaptor types of sequences, right?

In this case, we would rather have the G's trimmed or masked than the whole sequence thrown out.

marcelm commented 3 years ago

Cutadapt has the --nextseq-trim option for this. Sequencers using two-color chemistry cannot distinguish between G and "no nucleotide", so when they reach the end of the fragment, the read will contain a 3' end of high-quality G bases. Since the G bases are of high quality, the regular quality trimming algorithm cannot be used. Instead, for --nextseq-trim it is adjusted to essentially ignore the quality of G bases.

To remove sequences with poly-G tails, one can combine --nextseq-trim with a suitable --minimum-length. Using --discard-trimmed will not work because the quality-trimmed bases are not seen as adapters and are therefore not counted as "trimmed". Alternatively, one could provide a 3' adapter that just consists of a run of G nucleotides (-a GGGGGGGGGG) and then use --discard-trimmed.

Let me know if you would like me to work on this. If so, I would need a dataset in which this problem occurs.

simonelsasser commented 3 years ago

thanks, great. As I understand there would be two options: 1) remove entire read pair 2) mask second read, keep first read

I think 2) would be preferable because actually it's not just a negligible percentage of reads that have it, and read1 is 'good' since it's not a problem with the read per se. So if we mask (i.e. converting to N, as the --action=mask option?), then read1 would still map to correct place.

As far as I could see, if the cluster reversal fails, then all the bases of read2 are G, so one could set the minimum number quite high. An example where this happens is the dataset on uppmax, any of the input FASTQ will have such reads

proj/snic2020-6-3/yuying/minute_chip/2021_01_03_20170828BK_MC/final/fastq

marcelm commented 3 years ago

I’ve added the --nextseq-trim to one of the Cutadapt invocations in the pipeline, see #108.

Masking the second read isn’t possible with --action=mask because that option only applies to removed adapter sequences, not quality-trimmed sequences, but we should be able to get almost the same effect by just trimming away the poly-G tails using no other option than --nextseq-trim. For the relevant read pairs, the second read is then trimmed to a very short (or zero) length, and I would hope that the read mapper then essentially treats this as a single-end read.

I need to check this a bit more. I noticed that the minimal test dataset actually also has some poly-G artifacts, so I used that for testing, but none of the reads where R2 gets trimmed to a length of zero shows up in the output BAM file as mapped (many are removed because the R1 doesn’t pass some other quality check, and those that do show up are marked as unmapped).

I’m going to re-open this until I have a chance to investigate a bit more.

marcelm commented 2 years ago

I have now checked the above potential problem on a bigger dataset.

In the small test dataset, all read pairs for which R2 gets trimmed to a length of zero do not align to the reference. I was afraid that this was a systematic problem with Bowtie2, that is, that it would just ignore the entire pair if R2 has a length of zero. (It prints some warnings that could be interpreted that way.) However, that is not the case: Bowtie2 ignores only R2 and still attempts to map R1, which I can say now because it sucessfully mapped a couple of the R1 reads (when R2 was empty).

So Bowtie2 does everything correctly, it just happens to be the case that "empty R2 after ploy-G trimming" and "unmappable R1" are highly correlated.