Open gringer opened 3 years ago
Just in case it helps, I've written a script to splice out cell barcode linker sequences and shift them to before the polyA.
In the process of doing this, it also does a 2-distance hamming correction of cell barcode and linker regions. All operations assume there are no INDELs:
https://gitlab.com/gringer/bioinfscripts/-/blob/master/synthSquish.pl
[usual disclaimers apply: I cannot guarantee that this works; use at your own risk]
This script could be used as a stop-gap measure to pre-process Rhapsody reads for use with Alevin via the undocumented custom length settings [--end 5, --barcodeLength 27, --umiLength 8]
After discovering the alternative geometry format, I see that unmodified Rhapsody reads should have the following settings:
--umi-geometry '1[53-60]' --bc-geometry '1[1-9,22-30,44-52]' --read-geometry '2[1-end]'
There's a bit of a challenge regarding error correction for the cell barcodes, in that they should be corrected in batches of 9 nucleotides (into 96 clusters of the most commonly-seen sequences).
Rhapsody has introduced a new, shorter cell barcode specification to work with 51bp R1, which looks like this:
5' PFX - CLS1 - L1 - CLS2 - L2 - CLS3 - UMI - poly(T)
9 4 9 4 9 8 8
[1-9] [14-22] [27-35][36-43]
The linker sequences are as follows:
L1: GTGA
L2: GACA
In other words,
--umi-geometry '1[36-43]' --bc-geometry '1[1-9,14-22,27-35]' --read-geometry '2[1-end]'
However... [update]
In order to remove the need for Lambda spike-ins on Illumina runs, Rhapsody has included a 0-3bp cell barcode prefix, where either nothing, or A/GT/TCA
are added to the front of some cell barcodes. Here are the full descriptions:
# Long sequence:
# 0123456789012345678901234567890123456789012345678901234567890123456789012345
# [--BC1--][----L1----][--BC2--][----L02----][--BC3--][-UMI1-][TTTTTTTTTTTTTT]
# L1 = ACTGGCCTGCGA; L2 = GGTAGCGGTGACA
# Short sequence:
# 012345678901234567890123456789012345678901234567890
# [--BC1--][L1][--BC2--][L2][--BC3--][-UMI1-][TTTTTT]
# L1 = GTGA; L2 = GACA
# Note: short sequence can also be prepended with A/GT/TCA to improve Illumina base
# call distributions, i.e.
# [--BC1--][L1][--BC2--][L2][--BC3--][-UMI1-][TTTTTT]
# A[--BC1--][L1][--BC2--][L2][--BC3--][-UMI1-][TTTTT]
# GT[--BC1--][L1][--BC2--][L2][--BC3--][-UMI1-][TTTT]
# TCA[--BC1--][L1][--BC2--][L2][--BC3--][-UMI1-][TTT]
This means that the regions defined in the geometry specification above can appear up to 3bp away from their expected region. I've updated my barcode squishing script (here) to account for this. The script identifies the cell barcode regions, corrects cell barcode sequences according to the Rhapsody Bioinformatics manual, and then shifts the linker sequences to after the UMI region, i.e.:
# 012345678901234567890123456789012345678901234567890...
# [--BC1--][--BC2--][--BC3--][-UMI1-][L1][L2][TTTTTT...]
[The prefix sequence is discarded]
After using this script to pre-process R1, with both the old and new cell barcode format (both use 9x9x9 cell barcodes), the following geometry can be used for salmon alevin
:
--umi-geometry '1[28-35]' --bc-geometry '1[1-27]' --read-geometry '2[1-end]'
I've attached files containing the 96 barcodes from each region from my most recent Rhapsody single cell sequencing run (with 51bp R1 reads). These were collected by processing reads 2M-12M from R1 of one of our files, and choosing the most abundant sequences:
zgrep '^[ACGT]\+$' demultiplexed/HGJNNDMXY_S1_L001_R1_001.fastq.gz | tail -n +2000000 | head -n 10000000 | \
perl -pe 's/^(.{7,12})(GTGA).*$/$1-$2/' | sort | uniq -c | sort -rn | awk '{if($1 > 10000){print $2,$1}}' | \
perl -pe 's/^.*?(.{9})-/$1-/' > cell_barcode_counts_BC1.txt
zgrep '^[ACGT]\+$' demultiplexed/HGJNNDMXY_S1_L001_R1_001.fastq.gz | tail -n +2000000 | head -n 10000000 | \
perl -pe 's/^(A|GT|TCA)//;s/.{9}(.{4})(.{9})(.{4}).*$/$1-$2-$3/' | colrm 20 55 | sort | uniq -c | \
awk '{if($1 > 20000){print $2, $1}}' > cell_barcode_counts_BC2.txt
zgrep '^[ACGT]\+$' demultiplexed/HGJNNDMXY_S1_L001_R1_001.fastq.gz | tail -n +2000000 | head -n 10000000 | \
perl -pe 's/^(.{7,12})GTGA(.{7,12})GACA(.{9}).*/GACA-$3/' | sort | uniq -c | sort -rn | \
awk '{if($1 > 10000){print $2, $1}}' > cell_barcode_counts_BC3.txt
for i in $(seq 1 3);
do perl -pe 's/^.*?([ACGT]{9}).*$/$1/' cell_barcode_counts_BC${i}.txt | \
sort > cell_barcodes_BC${i}.txt;
done
cell_barcodes_BC1.txt cell_barcodes_BC2.txt cell_barcodes_BC3.txt
It looks like #734 would allow this barcode method to be specified directly:
Long original read geometry - 1{b[9]f[ACTGGCCTGCGA]b[9]f[GGTAGCGGTGACA]b[9]u[8]}2{r}
Short "enhanced" read geometry - 1{x[0-3]b[9]f[GTGA]b[9]f[GACA]b[9]u[8]}2{r}
There has been a small amount of discussion about the BD Rhapsody barcode / sequence format (e.g. see here), but it would be great if the option could be integrated into the code.
BD has produced a Single Cell Genomics Bioinformatics Handbook which has the following information about the R1 read structure on pg 14:
In other words...
R2 reads are transcript-only, and are expected to match a transcript's forward strand, with matches starting within the first five nucleotides (and not match PhiX174).
To Reproduce
Using Salmon v1.4.0, installed via Debian / apt:
Expected behavior
Desktop
Additional context
This is my first run with BD Rhapsody data (and our own single cell data, for that matter). We're currently using SevenBridges for generating gene count tables, but I don't like the black-box nature of that service. I'm much more comfortable when I know what's going on under the hood, and can tweak things when I notice oddness.