epi2me-labs / pychopper

cDNA read preprocessing
Other
61 stars 9 forks source link

Compatability w/ New PCR-cDNA Kit (111.24) #12

Closed NanoCoreUSA closed 1 year ago

NanoCoreUSA commented 1 year ago

Is pychopper compatible with the updated PCR-cDNA kit (SQK-PCB111.24)? Specifically, is the recommended way to handle the UMIs that are incorporated during the library prep? We have recently completed some runs using the new kit and I'm struggling with how to properly deal with collapsing the UMIs. Any advice would be appreciated.

rebeelouise commented 1 year ago

Did you ever get an answer for this/find a solution?

NanoCoreUSA commented 1 year ago

I made a post on the Nanopore forums a couple of months ago and received some replies... I haven't gotten around to testing them yet but there were a few solutions that seemed worthwhile. Here's the link to the discussion -

https://community.nanoporetech.com/posts/handling-umis-in-new-pcr-c

Let me know if you don't have trouble accessing the forum and I can just summarize the contents here.

Caffeinated-Code commented 1 year ago

Hi @NanoCoreUSA , We have been testing some sample runs with the new PCR kit - SQK-PCB111.24. Is this an alias of the PCS111 mentioned in the examples? Any inputs on using pychopper with this new lot with UMI extraction will be very helpful! Unfortunately, the link is not directed to the post.

NanoCoreUSA commented 1 year ago

Hi @Caffeinated-Code

So yes, supposedly pychopper can extract UMIs from the new kit with the 'PCS111' flag. There's some caveats though as outlined in that post (I think you have to be signed in to the Nanopore Community for the link to work...).

I'll attempt to summarize here -

The newest version of Pychopper can be used with the 111 kits that have UMIs by running it as:

pychopper -U -y -k PCS111

The -U option signals it should be detecting UMIs and the -y pushes this requisite extra info through to the aligner to subsequently attach it to the bam/sam file that it outputs. You will have to run minmiap2 with the -y option as well to get the UMI info to show up in the bam file.

My minimap2 command looks something like this:

minimap2 -y --eqx -k 12 -uf -ax splice genome.fa in.fastq > out.sam

Which will then produce a bam file with reads containing all this info below the sequence and Q:

The UMI should now be at the bottom right, after the RX:Z: tag.

The problem with this compared to short-read UMIs and their tools is that a typical short read dataset with a UMI present in the bam (which can then be used with umi-tools dedupe of UMICollpase) will have the umi attached to the QNAME field like this:

samtools view hisat_mapped_in.bam | head -1

NS500402:643:H5LCHBGXL:4:11612:2705:16953:ACAACG        163     DDB0191165|DDB_G0267380 1591    60      27M     =       1593    27      GATAAATTATATGGTAGAGGTACAACT     6EEEEE/A/EE<EAEEEEEE/EEEEEE     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0

And most tools I've been testing (umi-tools and UMICollapse) look for it there, resulting in...issues. I wish I could provide a nice solution to this but my current fix is to just move the UMI from the RX:Z tag to the QNAME using a bash/awk script + samtools:

samtools view -h "${input_bam}" | \

awk 'BEGIN {FS=OFS="\t"}
    {
        # Append RX:Z tag value to QNAME
        if ($0 ~ "RX:Z:") {
            split($0, a, "RX:Z:")
            split(a[2], b, "\t")
            $1 = $1 ":" b[1]
        }
        print
    }' | \

samtools view -bS -o "${output_bam}" -

Maybe someone else knows better than I, but I couldn't figure out a way to get UMICollapse to find the UMI from the RX:Z: tag location.

Other notable problems I've come across and my hacked workarounds:

1) Not all UMI lengths are the same.

If you count up the read lengths of the UMIs they vary by a nucleotide or so in length. After moving the UMI to the read name as I showed above you can create a CSV of the read name, UMI, and its length with another short awk script:

#!/usr/bin/mawk -f 

BEGIN {
    FS = "\t"    OFS = ","
    print "Read Name", "UMI", "UMI Length"
}

{
    # Extract UMI from read name
    split($1, a, ":")
    umi = a[length(a)]
    # Count length of UMI
    umi_length = length(umi)
    # Print 
   print $1, umi, umi_length
}

samtools view ${bamfile} | mawk -f count_umi_length_in_bam.awk | head -5

Which shows the varying lengths in the last field:

This will throw an error with umi_tools dedup:

AssertionError: not all umis are the same length(!): 28 - 29

If you want to use umi_tools dedup my hacked workaround is to break up the bam file based on umi length, then collapse each bam file individually, then merge them back together.

#!/bin/bash
# Input BAM with UMI appended to read name

input_bam=input.bam
# Separator of UMI

separator=":"
# Get all UMI lengths

umi_lengths=$(samtools view "${input_bam}" | awk -F $separator '{print length($NF)}' | sort -u)
# Loop over each umi length

for umi_length in $umi_lengths; do

    # Output BAM file for current UMI length

    output_bam="output_${umi_length}.bam"

    # Filter reads with current length and save to new BAM

    samtools view -h "${input_bam}" | awk -F $separator -v umi_length=$umi_length '{if (length($NF) == umi_length || substr($1, 1, 1) == "@") print}' | samtools view -bS - > "${output_bam}"

    # Index output BAM file    samtools index "${output_bam}"

    # Run umi_tools

    name=$(basename ${output_bam%.*})

    umi_tools dedup --umi-separator=: -I ${output_bam} --method=directional -S ${name}_clps.bam > ${name}_dedup.log

done
Caffeinated-Code commented 1 year ago

Hi @NanoCoreUSA , Thankyou very much for getting back with this work around! I tried the UMI extraction and inheritance with minimap2 in the tag. I am seeing a wider range of UMI lengths, 26-31bp and more than 50% of my reads have UMI set to none. These are from samples sequenced on the MinIon machine. Could it be possible that there is a high error rate accounting for indels in the UMIs?

NanoCoreUSA commented 1 year ago

@Caffeinated-Code Yes that post on the community website mentioned your observation as well... the solution for reads without a UMI was just to filter out the 'None' reads before further proceeding (they were only discard around 20% at most though, not 50%... if 50% of reads don't have UMI there might be something else going on).

As for the lengths, are you wanting to use UMI-tools? If so there's a script that can extract the lengths as well to get around the requirement that all the UMIs are the same length,

nrhorner commented 1 year ago

Hi @NanoCoreUSA

Do you still need help with this or has @NanoCoreUSA solved this for you?

nrhorner commented 1 year ago

No response, so closing this ticket