broadinstitute / longbow

Annotation and segmentation of MAS-seq data
https://broadinstitute.github.io/longbow/
BSD 3-Clause "New" or "Revised" License
20 stars 4 forks source link

missing umi and/or cell tag from alligned reads after longbow processing #127

Open khandaud15 opened 2 years ago

khandaud15 commented 2 years ago

Hi,

we just received an MAS-Iso seq long read data for single cell 3'p for test which I am trying to process as per the Longbow pipeline , the bam file from the longbow output is giving an error when I am trying to group the detected UMIs and mark the molecules with the same UMIs using UMI-Tools , it's throwing an error """ Read skipped, missing umi and/or cell tag: 61628044"""

here are the steps which I did 1- annotation, filtering, segmentation , extract 'longbow annotate -m mas15v2 -t 30 hifi_reads.bam | longbow filter | longbow segment | longbow extract -o filter_passed.bam'

2 , primary allignment of extracted bam with the hg38 genome using minimap2 'samtools fastq filtered_passed.bam | minimap2 -ayYL --MD -x splice:hq Homo_sapiens.GRCh38.dna.primary_assembly.fa - | samtools sort > align.bam && samtools index align.bam

3 - baseline transcript annotations via stringtie 2 to create new transcriptome annotations specific to our test data 'stringtie allign.bam -Lv -G gencode.v38.annotation2.gtf -o annotations.gtf -A gene_abund.out

4 - Based on the resulting transcript annotations we created a new transcript reference FASTA file using gffread 'gffread -w transcriptome.fa -g Homo_sapiens.GRCh38.dna.primary_assembly.fa annotations.gtf'

5 - aligned the extracted reads to the novel transcriptome reference using minimap2 v2.17-r941 ''samtools fastq filtered_passed.bam | minimap2 -ayYL --MD -x splice:hq transcriptome.fa - | samtools sort > align_2.bam && samtools index align_2.bam

6 - group the detected UMIs and mark the molecules with the same UMI using UMI-Tools

'umi_tools group --stdin= align_2.bam --buffer-whole-contig --no-sort-output --per-gene --gene-tag XG --extract-umi-method tag --umi-tag ZU --group-out=out.tsv --output-bam

here it is throwing an error with ''missing umi and/or cell tag from alligned reads''

could you please tell, why it's missing UMI from the longbow output files,

Thanks '

kvg commented 2 years ago

Hi Daud, Thanks for this report and the conversation we had over email on this issue. I'm replicating the answer from my email so that we can keep track of our progress.

I’ve looked into this error now and I understand what’s going on.  Essentially, there are two issues:

  1. As you mentioned, this is Single Cell 3’ data.  However, the mas15v2 model (admittedly poorly named) is meant for 5’ data.
  2. The samtools fastq command is not propagating BAM tags along to the aligned BAM file.  Thus, when umi_tools is run, it does not see the BAM tag where the UMI is meant to be stored and dies a rather inelegant death.

  Thankfully, these errors are relatively easy to fix.  I’ve included an example script with some changes to yours at the bottom of this message.  There are two major changes to your workflow.  First, I run a relatively new Longbow command called “peek” that automatically detects the appropriate model to apply.  The output on your test data looks as follows:

$ longbow peek -f -o annonate/model.txt daud.bam
[INFO 2022-03-15 15:27:46     peek] Invoked via: longbow peek -f -o annonate/model.txt daud.bam
[WARNING 2022-03-15 15:27:46 bam_utils] Output file exists: annonate/model.txt.  Overwriting.
[INFO 2022-03-15 15:27:46     peek] Running with 15 worker subprocess(es)
[INFO 2022-03-15 15:27:48     peek] Detecting best model in 100 reads
Progress: 101 read [00:45,  2.24 read/s]
[INFO 2022-03-15 15:28:33     peek] Histogram of most likely model counts:
[INFO 2022-03-15 15:28:33     peek]            mas15threeP █████████████████████████████████████████████████ 97 (96.0%)
[INFO 2022-03-15 15:28:33     peek]            mas10threeP ██ 4 (4.0%)
[INFO 2022-03-15 15:28:33     peek]                mas15v2 ▏ 0 (0.0%)
[INFO 2022-03-15 15:28:33     peek]   mas15BulkWithIndices ▏ 0 (0.0%)
[INFO 2022-03-15 15:28:33     peek]                mas10v2 ▏ 0 (0.0%)
[INFO 2022-03-15 15:28:33     peek]            slide-seqV2 ▏ 0 (0.0%)
[INFO 2022-03-15 15:28:33     peek]       mas15teloprimev2 ▏ 0 (0.0%)
[INFO 2022-03-15 15:28:33     peek]            scRNA_10x5p ▏ 0 (0.0%)
[INFO 2022-03-15 15:28:33     peek] Overall most likely model: mas15threeP (seen in 97 reads, 96.0%)
[INFO 2022-03-15 15:28:33     peek] Done. Elapsed time: 47.69s. Overall processing rate: 2.10 reads/s.

  You can see what effect that has on the annotation in the images I’ve attached for one read, processed with the mas15v2 and mas15threeP models.  In the first image, poly(T) sections are not properly annotated (given that the mas15v2 model looks for 5’ data instead).  In the second image, the poly(T) sections (and other landmarks) are annotated much more appropriately.

  The second change was to add the “-T XC,ZU” argument to samtools fastq.  That tells it to encode those particular BAM tags into the comment line of each fastq record.  Minimap2 with the -y argument then copies those tags back into the resulting BAM file (this argument to minimap2 is not documented well - a reference to it can be found here: https://lh3.github.io/minimap2/minimap2.html).

  Now the umi_tools group command should succeed:

$ umi_tools group --stdin=allignedBam/aligned.bam --buffer-whole-contig --no-sort-output --extract-umi-method tag --umi-tag ZU --group-out=out.tsv --output-bam --stdout=allignedBam/aligned.grouped.bam
# UMI-tools version: 1.1.2
# output generated by group --stdin=allignedBam/aligned.bam --buffer-whole-contig --no-sort-output --extract-umi-method tag --umi-tag ZU --group-out=out.tsv --output-bam --stdout=allignedBam/aligned.grouped.bam
# job started at Tue Mar 15 17:58:10 2022 on wm9a8-166 -- 44f80168-5d81-4b5a-99f5-7afc1a31178b
...
2022-03-15 17:58:10,167 INFO command: group --stdin=allignedBam/aligned.bam --buffer-whole-contig --no-sort-output --extract-umi-method tag --
umi-tag ZU --group-out=out.tsv --output-bam --stdout=allignedBam/aligned.grouped.bam
2022-03-15 17:58:11,161 INFO Written out 10000 reads
2022-03-15 17:58:11,274 INFO Reads: Input Reads: 11203
2022-03-15 17:58:11,274 INFO Number of reads out: 11203, Number of groups: 11200
2022-03-15 17:58:11,274 INFO Total number of positions deduplicated: 8143
2022-03-15 17:58:11,274 INFO Mean number of unique UMIs per position: 1.38
2022-03-15 17:58:11,274 INFO Max. number of unique UMIs per position: 19

  Hopefully that should fix the issue you were seeing.  A few things to clean up on our end: we’ll update our documentation to reflect our new “peek” command and carrying over the CBC/UMI tag to BAM files for use with downstream tools.

Finally, here's the modified script:

#!/bin/bash

set -euxo pipefail

INPUT_BAM="daud.bam"
THREADS=8
ANNOTATE_OUTDIR="annonate"
ALIGNMENT_OUTDIR="allignedBam"

mkdir -p $ANNOTATE_OUTDIR
mkdir -p $ALIGNMENT_OUTDIR

longbow version

# The new Longbow peek command allows the user to detect the array structure automatically.
longbow peek -f -o $ANNOTATE_OUTDIR/model.txt $INPUT_BAM

# These are debugging tools, useful for seeing how the HMM applied the model to the data.
longbow inspect --seg-score -r "m64152e_220303_222034/14/ccs" -m mas15v2 -o images/mas15v2 daud.bam
longbow inspect --seg-score -r "m64152e_220303_222034/14/ccs" -m mas15threeP -o images/mas15threeP daud.bam

longbow annotate -m $(cat $ANNOTATE_OUTDIR/model.txt) -t $THREADS $INPUT_BAM | \
  longbow filter | \
  longbow segment | \
  longbow extract -f -o $ANNOTATE_OUTDIR/extracted.bam

# Check that we have the ZU and XC tags in the extracted BAM file.
samtools view -c $ANNOTATE_OUTDIR/extracted.bam
samtools view $ANNOTATE_OUTDIR/extracted.bam | sed 's/\t/\n/g' | grep -c '^ZU'
samtools view $ANNOTATE_OUTDIR/extracted.bam | sed 's/\t/\n/g' | grep -c '^XC'

# Here, we need to tell samtools fastq to propogate the cell barcode and UMI tags to the aligned file.
samtools fastq -T XC,ZU $ANNOTATE_OUTDIR/extracted.bam | \
  minimap2 -ayYL --MD -x splice:hq GCA_000001405.15_GRCh38_no_alt_analysis_set.fa - | \
  samtools sort > $ALIGNMENT_OUTDIR/aligned.bam

samtools index $ALIGNMENT_OUTDIR/aligned.bam

# Recheck that we have the ZU and XC tags in the aligned BAM file.
samtools view -c $ALIGNMENT_OUTDIR/aligned.bam
samtools view $ALIGNMENT_OUTDIR/aligned.bam | sed 's/\t/\n/g' | grep -c '^ZU'
samtools view $ALIGNMENT_OUTDIR/aligned.bam | sed 's/\t/\n/g' | grep -c '^XC'

# Now with the proper array model and propagated tags, the umi_tools command should succeed.
# Note that because I'm aligning to the genome and skipping some of the gene annotation steps,
# and because I'm working with a tiny subset of the data, I'm omitting the arguments where
# we tell umi_tools to look at the gene name tag in the BAM file.
umi_tools group --stdin=$ALIGNMENT_OUTDIR/aligned.bam \
                --buffer-whole-contig \
                --no-sort-output \
                --extract-umi-method tag \
                --umi-tag ZU \
                --group-out=out.tsv \
                --output-bam \
                --stdout=$ALIGNMENT_OUTDIR/aligned.grouped.bam

Thanks, -Kiran

m64152e_220303_222034_14_ccs.mas15v2.pdf m64152e_220303_222034_14_ccs.mas15threeP.pdf