Closed NanoCoreUSA closed 1 year ago
Did you ever get an answer for this/find a solution?
I made a post on the Nanopore forums a couple of months ago and received some replies... I haven't gotten around to testing them yet but there were a few solutions that seemed worthwhile. Here's the link to the discussion -
https://community.nanoporetech.com/posts/handling-umis-in-new-pcr-c
Let me know if you don't have trouble accessing the forum and I can just summarize the contents here.
Hi @NanoCoreUSA , We have been testing some sample runs with the new PCR kit - SQK-PCB111.24. Is this an alias of the PCS111 mentioned in the examples? Any inputs on using pychopper with this new lot with UMI extraction will be very helpful! Unfortunately, the link is not directed to the post.
Hi @Caffeinated-Code
So yes, supposedly pychopper can extract UMIs from the new kit with the 'PCS111' flag. There's some caveats though as outlined in that post (I think you have to be signed in to the Nanopore Community for the link to work...).
I'll attempt to summarize here -
The newest version of Pychopper can be used with the 111 kits that have UMIs by running it as:
pychopper -U -y -k PCS111
The -U option signals it should be detecting UMIs and the -y pushes this requisite extra info through to the aligner to subsequently attach it to the bam/sam file that it outputs. You will have to run minmiap2 with the -y option as well to get the UMI info to show up in the bam file.
My minimap2 command looks something like this:
minimap2 -y --eqx -k 12 -uf -ax splice genome.fa in.fastq > out.sam
Which will then produce a bam file with reads containing all this info below the sequence and Q:
The UMI should now be at the bottom right, after the RX:Z: tag.
The problem with this compared to short-read UMIs and their tools is that a typical short read dataset with a UMI present in the bam (which can then be used with umi-tools dedupe of UMICollpase) will have the umi attached to the QNAME field like this:
samtools view hisat_mapped_in.bam | head -1
NS500402:643:H5LCHBGXL:4:11612:2705:16953:ACAACG 163 DDB0191165|DDB_G0267380 1591 60 27M = 1593 27 GATAAATTATATGGTAGAGGTACAACT 6EEEEE/A/EE<EAEEEEEE/EEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
And most tools I've been testing (umi-tools and UMICollapse) look for it there, resulting in...issues. I wish I could provide a nice solution to this but my current fix is to just move the UMI from the RX:Z tag to the QNAME using a bash/awk script + samtools:
samtools view -h "${input_bam}" | \
awk 'BEGIN {FS=OFS="\t"}
{
# Append RX:Z tag value to QNAME
if ($0 ~ "RX:Z:") {
split($0, a, "RX:Z:")
split(a[2], b, "\t")
$1 = $1 ":" b[1]
}
print
}' | \
samtools view -bS -o "${output_bam}" -
Maybe someone else knows better than I, but I couldn't figure out a way to get UMICollapse to find the UMI from the RX:Z: tag location.
Other notable problems I've come across and my hacked workarounds:
1) Not all UMI lengths are the same.
If you count up the read lengths of the UMIs they vary by a nucleotide or so in length. After moving the UMI to the read name as I showed above you can create a CSV of the read name, UMI, and its length with another short awk script:
#!/usr/bin/mawk -f
BEGIN {
FS = "\t" OFS = ","
print "Read Name", "UMI", "UMI Length"
}
{
# Extract UMI from read name
split($1, a, ":")
umi = a[length(a)]
# Count length of UMI
umi_length = length(umi)
# Print
print $1, umi, umi_length
}
samtools view ${bamfile} | mawk -f count_umi_length_in_bam.awk | head -5
Which shows the varying lengths in the last field:
This will throw an error with umi_tools dedup:
AssertionError: not all umis are the same length(!): 28 - 29
If you want to use umi_tools dedup my hacked workaround is to break up the bam file based on umi length, then collapse each bam file individually, then merge them back together.
#!/bin/bash
# Input BAM with UMI appended to read name
input_bam=input.bam
# Separator of UMI
separator=":"
# Get all UMI lengths
umi_lengths=$(samtools view "${input_bam}" | awk -F $separator '{print length($NF)}' | sort -u)
# Loop over each umi length
for umi_length in $umi_lengths; do
# Output BAM file for current UMI length
output_bam="output_${umi_length}.bam"
# Filter reads with current length and save to new BAM
samtools view -h "${input_bam}" | awk -F $separator -v umi_length=$umi_length '{if (length($NF) == umi_length || substr($1, 1, 1) == "@") print}' | samtools view -bS - > "${output_bam}"
# Index output BAM file samtools index "${output_bam}"
# Run umi_tools
name=$(basename ${output_bam%.*})
umi_tools dedup --umi-separator=: -I ${output_bam} --method=directional -S ${name}_clps.bam > ${name}_dedup.log
done
Hi @NanoCoreUSA , Thankyou very much for getting back with this work around! I tried the UMI extraction and inheritance with minimap2 in the tag. I am seeing a wider range of UMI lengths, 26-31bp and more than 50% of my reads have UMI set to none. These are from samples sequenced on the MinIon machine. Could it be possible that there is a high error rate accounting for indels in the UMIs?
@Caffeinated-Code Yes that post on the community website mentioned your observation as well... the solution for reads without a UMI was just to filter out the 'None' reads before further proceeding (they were only discard around 20% at most though, not 50%... if 50% of reads don't have UMI there might be something else going on).
As for the lengths, are you wanting to use UMI-tools? If so there's a script that can extract the lengths as well to get around the requirement that all the UMIs are the same length,
Hi @NanoCoreUSA
Do you still need help with this or has @NanoCoreUSA solved this for you?
No response, so closing this ticket
Is pychopper compatible with the updated PCR-cDNA kit (SQK-PCB111.24)? Specifically, is the recommended way to handle the UMIs that are incorporated during the library prep? We have recently completed some runs using the new kit and I'm struggling with how to properly deal with collapsing the UMIs. Any advice would be appreciated.