bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
990 stars 355 forks source link

How to run bcbio targeted sequencing alignment and variant calling when having a 3rd UMI fastq file #3333

Closed WimSpee closed 3 years ago

WimSpee commented 4 years ago

Version info

Additional context I have some targeted sequencing data where after de-multiplexing I have 3 fastq files per sample.

  1. Sample_976.1.fastq.gz
  2. Sample_976.2.fastq.gz
  3. Sample_976.index_1.fastq.gz

In the third fastq file there is a 12 bp UMI for each read pair in the other fastq files.
Is it possible to align and variant call these using bcbio? How should the fastq sample table or final YAML file look?

I could not find this information in documentation, but maybe I did not look in the correct place.

Thank you.

WimSpee commented 4 years ago

I used fgbio AnnotateBamWithUmis to add the UMI tags to the BAM files that bwa-mem created on the R1 and R2 fastq files. http://fulcrumgenomics.github.io/fgbio/tools/latest/AnnotateBamWithUmis.html

Then marked duplicates using Picard. See the script below.

Not sure if this is also possible in bcbio. If so how. If not, if there are alternatives ways to do the same in bcbio. If not, if this is maybe something to add to bcbio.

After this script the BAMs can go in to a standard variant calling pipeline.

# Get a list of R1 fastq file and then a list of sample names
find ./ -name "*.1.fastq.gz" > R1.fastq.list
cut -d "/" -f 2 R1.fastq.list  | cut -d "." -f 1 > R1.fastq.list.samples

# Loop over the sample names to create the sam files
while read sample; do bwa mem my_ref.fa ${sample}.1.fastq.gz ${sample}.2.fastq.gz -t 20 -R "@RG\tID:${sample}\tSM:${sample}" > ${sample}.sam   ; done < R1.fastq.list.samples

# Loop over the sample names to create the bam files
while read sample; do samtools view -b  ${sample}.sam > ${sample}.bam ; done < R1.fastq.list.samples

# Loop over the sample names to create the sorted bam files
while read sample;  do samtools sort  ${sample}.sam -o  ${sample}.sorted.bam ; done < R1.fastq.list.samples

# Loop over the sample names to create the bam files that have the UMI tags for each read
while read sample;  do fgbio  AnnotateBamWithUmis -i ${sample}.sorted.bam -f ${sample}.index_1.fastq.gz-o ${sample}.sorted.umi.bam METRICS_FILE=${sample}.sorted.umi.metrics.txt BARCODE_TAG=RX   ; done < R1.fastq.list.samples

# Loop over the sample names to create the bam files that have duplicate reads marked 
while read sample;  do picard  MarkDuplicates  INPUT=${sample}.sorted.umi.bam OUTPUT=${sample}.sorted.umi.dedup.bam METRICS_FILE=${sample}.sorted.umi.dedup.metrics.txt BARCODE_TAG=RX   ; done < R1.fastq.list.samples

# Loop over the sample names to create the indexes for the deduplicated bam files
while read sample;  do samtools index  ${sample}.sorted.umi.dedup.bam ; done < R1.fastq.list.samples
naumenko-sa commented 4 years ago

Hi @WimSpee !

Yes, we do the variant calling with UMIs, to incorporate them you use: https://github.com/bcbio/bcbio-nextgen/blob/master/scripts/bcbio_fastq_umi_prep.py This converts R1, R2, R3 > R1, R2 with UMIs in the fastq header.

It is described in the somatic user story: https://bcbio-nextgen.readthedocs.io/en/latest/contents/somatic_variants.html#umis

S

WimSpee commented 4 years ago

Hi @naumenko-sa . How do all downstream tools know what to do with the UMIs in the readname?

BWA doesn't do anything with the readname, just put it as is in the BAM file.? And Picard MarkDuplicates requires he UMI to be in the RX tag? So somehow the UMI goes from being part of the readname to being in the RX tag in a BAM files?

The user story that you give is for Somatic. Do UMIs in bcbio also work for whole genome sequencing, targeted sequencing and RNAseq?

Thank you.

naumenko-sa commented 4 years ago

Hi @WimSpee!

You turn on UMI processing with umi_type: fastq_name as in here https://bcbio-nextgen.readthedocs.io/en/latest/contents/internals.html#somatic-tumor-only-variant-calling-pipeline-with-umis-step-by-step then you may see from the steps:

It all should work in the variant2 pipeline, i.e. both somatic and germline, WGS, WES, and panels.

In RNA-seq (single cell) UMIs have the whole different meaning - they are used to calculate cells/transcripts for different SC protocols: https://bcbio-nextgen.readthedocs.io/en/latest/contents/single_cell.html#parameters.

Not sure, if UMI collapsing works for variant calling in bulk RNA-seq. If you clarify that in a quick test run, that would be highly appreciated.

Sergey

naumenko-sa commented 3 years ago

closing. Please re-open if there are more UMI-related questions!