COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
772 stars 162 forks source link

Feature request (Alevin): add rhapsody barcode mode #628

Open gringer opened 3 years ago

gringer commented 3 years ago

There has been a small amount of discussion about the BD Rhapsody barcode / sequence format (e.g. see here), but it would be great if the option could be integrated into the code.

BD has produced a Single Cell Genomics Bioinformatics Handbook which has the following information about the R1 read structure on pg 14:

5' CLS1 - L1 - CLS2 - L2 - CLS3 - UMI - poly(T)
      9   12      9   13      9     8        18
  [1-9]     [22-30]      [44-52][53-60]

Cell Label Information of the cell label is captured by bases in three sections (CLS1, CLS2, CLS3) along each R1 read. Two common sequences (L1, L2) separate the three CLSs, and the presence of L1 and L2 relates to the way the capture oligonucleotide probes on the beads are constructed. By design, each CLS has one of 96 predefined sequences, which has a Hamming distance of at least four bases and an edit distance of at least two bases apart. A cell label is defined by the unique combination of predefined sequences in the three CLSs. Thus, the maximum possible number of cell labels is 96^3 (884,736). A cell label is represented by an index between 1–96^3.

Reads are first checked for perfect matches in all three pre-designed CLS sequences at the expected locations, CLS1: position 1–9, CLS2: position 22–30, and CLS3: position 44–52. Reads with perfect matches are kept.

In other words...

UMI By design, the UMI is a string of eight randomers immediately downstream of CLS3. If the CLSs have perfect matches or base substitutions, the UMI sequence is at position 53–60. For reads with insertions or deletions within the CLSs, the UMI sequence is eight bases immediately following the end of the identified CLS3.

R2 reads are transcript-only, and are expected to match a transcript's forward strand, with matches starting within the first five nucleotides (and not match PhiX174).

To Reproduce

Using Salmon v1.4.0, installed via Debian / apt:

$ salmon alevin -l ISR -1  INDEXLIBNOVASEQ_M001_R1.fastq.gz -2 INDEXLIBNOVASEQ_M001_R2.fastq.gz --rhapsody  -i /mnt/ufds/salmon/gencode_M23/salmon_1.4.0_decoy_M23 -p 10 -o alevin_output --tgMap /mnt/ufds/salmon/gencode_M23/txp2gene_gencode.vM23.tsv
Exception : [unrecognised option '--rhapsody']. Exiting.

Expected behavior

Logs will be written to alevin_output/logs
[2021-02-10 12:57:59.773] [jointLog] [info] setting maxHashResizeThreads to 10
[2021-02-10 12:57:59.773] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2021-02-10 12:57:59.773] [jointLog] [info] The --mimicBT2, --mimicStrictBT2 and --hardFilter flags imply mapping validation (--validateMappings). Enabling mapping validation.
[2021-02-10 12:57:59.774] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2021-02-10 12:57:59.774] [jointLog] [info] The use of range-factorized equivalence classes does not make sense in conjunction with --hardFilter.  Disabling range-factorized equivalence classes. 
[2021-02-10 12:57:59.774] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2021-02-10 12:57:59.774] [jointLog] [info] Using default value of 0.87 for minScoreFraction in Alevin
Using default value of 0.6 for consensusSlack in Alevin
...
[etc.]

Desktop

Additional context

This is my first run with BD Rhapsody data (and our own single cell data, for that matter). We're currently using SevenBridges for generating gene count tables, but I don't like the black-box nature of that service. I'm much more comfortable when I know what's going on under the hood, and can tweak things when I notice oddness.

gringer commented 3 years ago

Just in case it helps, I've written a script to splice out cell barcode linker sequences and shift them to before the polyA.

In the process of doing this, it also does a 2-distance hamming correction of cell barcode and linker regions. All operations assume there are no INDELs:

https://gitlab.com/gringer/bioinfscripts/-/blob/master/synthSquish.pl

[usual disclaimers apply: I cannot guarantee that this works; use at your own risk]

This script could be used as a stop-gap measure to pre-process Rhapsody reads for use with Alevin via the undocumented custom length settings [--end 5, --barcodeLength 27, --umiLength 8]

gringer commented 3 years ago

After discovering the alternative geometry format, I see that unmodified Rhapsody reads should have the following settings:

--umi-geometry '1[53-60]' --bc-geometry '1[1-9,22-30,44-52]' --read-geometry '2[1-end]'

There's a bit of a challenge regarding error correction for the cell barcodes, in that they should be corrected in batches of 9 nucleotides (into 96 clusters of the most commonly-seen sequences).

gringer commented 1 year ago

Rhapsody has introduced a new, shorter cell barcode specification to work with 51bp R1, which looks like this:

5' PFX - CLS1 - L1 - CLS2 - L2 - CLS3 - UMI - poly(T)
           9   4       9   4       9     8        8
        [1-9]     [14-22]      [27-35][36-43]

The linker sequences are as follows:

L1: GTGA L2: GACA

In other words,

--umi-geometry '1[36-43]' --bc-geometry '1[1-9,14-22,27-35]' --read-geometry '2[1-end]'

However... [update]

In order to remove the need for Lambda spike-ins on Illumina runs, Rhapsody has included a 0-3bp cell barcode prefix, where either nothing, or A/GT/TCA are added to the front of some cell barcodes. Here are the full descriptions:

  # Long sequence:
  # 0123456789012345678901234567890123456789012345678901234567890123456789012345
  # [--BC1--][----L1----][--BC2--][----L02----][--BC3--][-UMI1-][TTTTTTTTTTTTTT]
  # L1 = ACTGGCCTGCGA; L2 = GGTAGCGGTGACA
  # Short sequence:
  # 012345678901234567890123456789012345678901234567890
  # [--BC1--][L1][--BC2--][L2][--BC3--][-UMI1-][TTTTTT]
  # L1 = GTGA; L2 = GACA
  # Note: short sequence can also be prepended with A/GT/TCA to improve Illumina base
  # call distributions, i.e.
  # [--BC1--][L1][--BC2--][L2][--BC3--][-UMI1-][TTTTTT]
  # A[--BC1--][L1][--BC2--][L2][--BC3--][-UMI1-][TTTTT]
  # GT[--BC1--][L1][--BC2--][L2][--BC3--][-UMI1-][TTTT]
  # TCA[--BC1--][L1][--BC2--][L2][--BC3--][-UMI1-][TTT]

This means that the regions defined in the geometry specification above can appear up to 3bp away from their expected region. I've updated my barcode squishing script (here) to account for this. The script identifies the cell barcode regions, corrects cell barcode sequences according to the Rhapsody Bioinformatics manual, and then shifts the linker sequences to after the UMI region, i.e.:

# 012345678901234567890123456789012345678901234567890...
# [--BC1--][--BC2--][--BC3--][-UMI1-][L1][L2][TTTTTT...]

[The prefix sequence is discarded]

After using this script to pre-process R1, with both the old and new cell barcode format (both use 9x9x9 cell barcodes), the following geometry can be used for salmon alevin:

--umi-geometry '1[28-35]' --bc-geometry '1[1-27]' --read-geometry '2[1-end]'

I've attached files containing the 96 barcodes from each region from my most recent Rhapsody single cell sequencing run (with 51bp R1 reads). These were collected by processing reads 2M-12M from R1 of one of our files, and choosing the most abundant sequences:

zgrep '^[ACGT]\+$' demultiplexed/HGJNNDMXY_S1_L001_R1_001.fastq.gz | tail -n +2000000 | head -n 10000000 | \
  perl -pe 's/^(.{7,12})(GTGA).*$/$1-$2/' | sort | uniq -c | sort -rn | awk '{if($1 > 10000){print $2,$1}}' | \
  perl -pe 's/^.*?(.{9})-/$1-/' > cell_barcode_counts_BC1.txt
zgrep '^[ACGT]\+$' demultiplexed/HGJNNDMXY_S1_L001_R1_001.fastq.gz | tail -n +2000000 | head -n 10000000 | \
  perl -pe 's/^(A|GT|TCA)//;s/.{9}(.{4})(.{9})(.{4}).*$/$1-$2-$3/' | colrm 20 55 | sort | uniq -c | \
  awk '{if($1 > 20000){print $2, $1}}' > cell_barcode_counts_BC2.txt
zgrep '^[ACGT]\+$' demultiplexed/HGJNNDMXY_S1_L001_R1_001.fastq.gz | tail -n +2000000 | head -n 10000000 | \
perl -pe 's/^(.{7,12})GTGA(.{7,12})GACA(.{9}).*/GACA-$3/' | sort | uniq -c | sort -rn | \
  awk '{if($1 > 10000){print $2, $1}}' > cell_barcode_counts_BC3.txt
for i in $(seq 1 3);
  do perl -pe 's/^.*?([ACGT]{9}).*$/$1/' cell_barcode_counts_BC${i}.txt | \
    sort > cell_barcodes_BC${i}.txt;
done

cell_barcodes_BC1.txt cell_barcodes_BC2.txt cell_barcodes_BC3.txt

gringer commented 1 year ago

It looks like #734 would allow this barcode method to be specified directly:

Long original read geometry - 1{b[9]f[ACTGGCCTGCGA]b[9]f[GGTAGCGGTGACA]b[9]u[8]}2{r} Short "enhanced" read geometry - 1{x[0-3]b[9]f[GTGA]b[9]f[GACA]b[9]u[8]}2{r}