help combining bismark pipeline with methylated UMIs

shawpa commented 3 months ago

I have a whole genome em-seq methylation library with dual indexes as well as methylated UMI's at the beginning of each R1 and R2. These are from Twist and were used to make EM-seq libraries. I am not sure how I need to modify my current bismark pipeline to use the --barcode option when I do the deduplication. I saw in a previous thread that you have a program called UMI-Grinder but I'm not sure where it fits (if at all) in the pipeline. Normally I do the following:

Trimming with trim galore
PE Alignment with bismark bowtie 2 option
deduplication
methylation calling

The UMIs are going to be the first 5 bases of each read1 and read2. The next 2 bases are supposed to be skipped and then my genomic sequencing should begin. I looked at the UMI-Grinder documention but the input seems to be a bam. Not sure how I get to a bam from a fastq without aligning and I can't align because the UMI's at the beginning of the read. Thank you for your help.

FelixKrueger commented 3 months ago

Hi there, apologies for the slow response.

UMIbam is indeed a tool for UMI-aware deduplication, kind of a slightly more capable version of deduplicate_bismark.

What you need here is a way of transferring the UMI-sequence from the beginning of both reads to the readIDs, and then also remove the 2 additional bp, Trim Galore already has dedicated methods of doing this for the the --clock or --implicon modes:

--implicon              This is a special mode of operation for paired-end data, such as required for the IMPLICON method, where a UMI sequence
                        is getting transferred from the start of Read 2 to the readID of both reads. Following this, Trim Galore will exit.

                        In it's current implementation, the UMI carrying reads come in the following format:

                        Read 1  5' FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 3'
                        Read 2  3' UUUUUUUUFFFFFFFFFFFFFFFFFFFFFFFFFFFF 5'

                        Where UUUUUUUU is a random 8-mer unique molecular identifier (UMI) and FFFFFFF... is the actual fragment to be
                        sequenced. The UMI of Read 2 (R2) is written into the read ID of both reads and removed from the actual sequence.
                        Here is an example:

                        R1: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT
                            ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
                        R2: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT
                            CAATTTTGCAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA

                        After --implicon trimming:
                        R1: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT:CAATTTTG
                            ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
                        R2: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT:CAATTTTG
                                    CAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA

It should be easy enough to implement this additional trimming, you could for example use the code for --implicon and modify it slightly to meet your needs. If this is a method is getting used more widely it might make sense to add it as an additional mode of Specialised Trimming.

shawpa commented 3 months ago

It seems like the --clock is most similar to what I am doing because I have UMIs on both ends. I have no "fixed" bases so that part of the code could be removed. Hopefully I can find someone to implement this type of code change. Do the following steps sound right:

Remove adapter sequences in case any are in my reads.
Run modified --clock script to append UMI's to R1 and R2.
Hard clip at least 2 bases on the 3' and 5' ends.
Run trimgalore using default settings for length and quality trimming. Not sure if this step would be necessary.

Am I missing anything important on the trimming step?

For deduplication, I would use UMIbam instead of deduplicate_bismark in my pipeline.

FelixKrueger commented 3 months ago

I think you can start with Steps 2+3 (this can be done in one go), and then proceed with Step 4 (which will remove adapter and quality). If you add both barcodes from R1 and R2 as a string after a colon together, deduplicate_bismark should work. Something like this:

@somereadID:CGATG_CAATT

If you can't find someone to apply a local fix I can probably take a look for you.

shawpa commented 2 months ago

I just got my data and I want to confirm how the RID would need to be formatted so that the current deduplicate_bismark would work properly. An example of the R1 and R2 are below. I cut off some of the read for clarity.

@LH00512:33:22N7MVLT3:4:1101:1564:1080 1:N:0:GATAGCCA+CTGTGTTG
GNATAGTAATTTGGTATTTATATATTGTGTTTGTTAAATTTAGTGATGTATATAGGTTTAAATATTGTTTTTTATGTGTTTTAATTTTTTATAGTTTTTTAAT

@LH00512:33:22N7MVLT3:4:1101:1564:1080 2:N:0:GATAGCCA+CTGTGTTG
CNCTGTACTCCACATTTTAAAATATATAAAACCACAAAAACATCATCTTCTATATCTTATTCTTACTAACTATATAACAATAACATTCTTAAATAAACT

So the UMI for R1 in this case is GNATA and the UMI for R2 is CNCTG. Not a great example because of the N's. You had previously said it needed to be formatted as

@somereadID:CGATG_CAATT

So does this mean my RIDs should look like the following:

@LH00512:33:22N7MVLT3:4:1101:1564:1080 1:N:0:GATAGCCA+CTGTGTTG:GNATA_CNCTG
     GTAATTTGGTATTTATATATTGTGTTTGTTAAATTTAGTGATGTATATAGGTTTAAATATTGTTTTTTATGTGTTTTAATTTTTTATAGTTTTTTAAT

@LH00512:33:22N7MVLT3:4:1101:1564:1080 2:N:0:GATAGCCA+CTGTGTTG:GNATA_CNCTG
     TACTCCACATTTTAAAATATATAAAACCACAAAAACATCATCTTCTATATCTTATTCTTACTAACTATATAACAATAACATTCTTAAATAAACT

Or do I also need to remove the P7 and P5 barcodes from the RID?

FelixKrueger commented 2 months ago

Yes that's correct, it will take take the last element after the last :, so here GNATA_CNCTG and use this in addition to the chromosome, start, end and orientation of a read for the deduplication process.

zimzims commented 2 months ago

Hi Felix! I will be the one helping @shawpa with this task. Right now, the functionality we are looking to implement is to remove the first N bases of R1 and R2 and then appending those to each of the read ids as Annie showed above. We are also looking to be able to receive either a list or file of "valid" UMIs and if either R1 or R2 has a UMI that is not valid, throw away the pair. I've used cutadapt/trim_galore a bunch, but never added functionality. I would love to add the functionality and make it open source, what is the best way of going about this?

FelixKrueger commented 2 months ago

Hi Zach,

This is a very interesting idea, incidentally we have talked about Twist libraries and also libraries from Watchmaker Genomics that will apparently have a similar in-line barcoding strategy for both R1 and R2. As it turns out, it seem I may have a real interest in making this become reality myself :) My first instinct would probably be to have such a 5' inline barcode transfer function (for a variable number of nucleotides) set up as a stand-alone function (for both SE and PE), and potentially one that can be called as part of the trimming process. The latter would require a little more thinking though....

Adding functionality to validate UMIs against a UMI-allowed or whitelist seems like a rather particular, application-specific step and I wonder if this isn't best handled afterwards using a post-processing script of some sort?

I could probably have a go at UMI/inline-barcode transfer function as a stand-alone process (similar to e.g. --clock or --implicon). Should I have a go?

zimzims commented 2 months ago

Hey Felix, sorry for the late response. If you are up for it, then by all means give it a go! I agree about validating UMI as a post-processing script. We will handle that separately. If there is anything I can do to help, please let me know. Thank you for help

FelixKrueger / Bismark

help combining bismark pipeline with methylated UMIs #680