broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
984 stars 369 forks source link

IlluminaBasecallsToSam skipping barcodes #1774

Open nchernia opened 2 years ago

nchernia commented 2 years ago

Bug Report

Affected tool(s)

IlluminaBasecallsToSam (possibly ExtractIlluminaBarcodes)

Affected version(s)

Version:2.26.3

Description

I am using ExtractIlluminaBarcodes and IlluminaBasecallsToSam as a first demultiplexing step to separate reads by library barcode. I am comparing to doing two passes via bcl2fastq and then directly examining the index reads to demultiplex by library.

The bam files produced by IlluminaBasecallsToSam are missing barcodes that exist in bcl2fastq. These are exact matches with high quality basecall strings, e.g. FFFFFFFF. I have tested this on two different bcl files so far and in both cases there's approximately 0.1% of reads missing.

Since ExtractIlluminaBarcodes is called before IlluminaBasecallsToSam and reports the number of matches per library (which is equal to the number of reads in the corresponding library bam file), this might be an error in ExtractIlluminaBarcodes instead of IlluminaBasecallsToSam.

Steps to reproduce

ExtractIlluminaBarcodes BASECALLS_DIR="Data/Intensities/BaseCalls" TMP_DIR=. OUTPUT_DIR=. \ BARCODE_FILE=barcode_params.tsv METRICS_FILE=barcode_metrics1.tsv \ READ_STRUCTURE=50T14S10M29S10M27S9M8B50T \ 
LANE=1 NUM_PROCESSORS=4 COMPRESSION_LEVEL=1 GZIP=true

barcode_params.tsv

barcode_sequence_1      barcode_name    library_name
GCGATCTA        P1.01   P1.01
ATAGAGAG        P1.02   P1.02
AGAGGATA        P1.03   P1.03
TCTACTCT        P1.04   P1.04
CTCCTTAC        P1.05   P1.05
TATGCAGT        P1.06   P1.06
TACTCCTT        P1.07   P1.07
AGGCTTAG        P1.08   P1.08
GATTTCCA        P1.09   P1.09
ATCATGTT        P1.10   P1.10
TTTCATCA        P1.11   P1.11
AGTCCGAC        P1.12   P1.12
GCTAGAAA        P1.13   P1.13
CTTGGTTA        P1.14   P1.14
CGATACAC        P1.15   P1.15
TTGATGGA        P1.16   P1.16
IlluminaBasecallsToSam BASECALLS_DIR="Data/Intensities/BaseCalls" BARCODES_DIR=. TMP_DIR=. \
LIBRARY_PARAMS=library_params1.tsv READ_STRUCTURE=50T14S10M29S10M27S9M8B50T  \
LANE=1 RUN_BARCODE="A01253:322:HNGMYDSX2" \
SEQUENCING_CENTER="BI" NUM_PROCESSORS=4 \
IGNORE_UNEXPECTED_BARCODES=true 

library_params1.tsv

OUTPUT  SAMPLE_ALIAS    LIBRARY_NAME    BARCODE_1
P1.01.bam       P1.01   P1.01   GCGATCTA
P1.02.bam       P1.02   P1.02   ATAGAGAG
P1.03.bam       P1.03   P1.03   AGAGGATA
P1.04.bam       P1.04   P1.04   TCTACTCT
P1.05.bam       P1.05   P1.05   CTCCTTAC
P1.06.bam       P1.06   P1.06   TATGCAGT
P1.07.bam       P1.07   P1.07   TACTCCTT
P1.08.bam       P1.08   P1.08   AGGCTTAG
P1.09.bam       P1.09   P1.09   GATTTCCA
P1.10.bam       P1.10   P1.10   ATCATGTT
P1.11.bam       P1.11   P1.11   TTTCATCA
P1.12.bam       P1.12   P1.12   AGTCCGAC
P1.13.bam       P1.13   P1.13   GCTAGAAA
P1.14.bam       P1.14   P1.14   CTTGGTTA
P1.15.bam       P1.15   P1.15   CGATACAC
P1.16.bam       P1.16   P1.16   TTGATGGA

Expected behavior

All reads with matching library barcode are in appropriate bam file

Actual behavior

0.1% of reads with exact match barcodes are missing.

nchernia commented 2 years ago

An update: I have tried running with a simpler read structure (50T99S8B50T) and this did not change the results. I have also tried deleting the barcodes files and running just IlluminaBasecallsToSam with MATCH_BARCODES_INLINE=true and this also did not change the results.

gbggrant commented 2 years ago

Hi @nchernia have you tried running this on a newer version of Picard (the latest release is 2.26.10)? We fixed a number of errors in EIB and IBCTo[Sam|Fastq] last year (although this does not sound like one of those).

Also, is it possible that the missing barcodes are those that do not exactly match the input barcodes list (have more than one error)? That would be indicated in the metrics file generated by EIB. If possible, can you upload that file too?

Thanks.

George

nchernia commented 2 years ago

Hi George,

Thanks for getting back to me.

I've rerun with the latest jar and the results are the same.

Here is the head of the metrics file.

## htsjdk.samtools.metrics.StringHeader
# IlluminaBasecallsToSam RUN_BARCODE=A01253:322:HNGMYDSX2 SEQUENCING_CENTER=BI LIBRARY_PARAMS=/mnt/disks/big_data/211110_SL-NVV_0322_AHNGMYDSX2/library_params1.tsv NUM_PROCESSORS=0
 APPLY_EAMSS_FILTER=false IGNORE_UNEXPECTED_BARCODES=true MATCH_BARCODES_INLINE=true LANE=[1] READ_STRUCTURE=50T14S10M28S10M28S9M8B50T BASECALLS_DIR=Data/Intensities/BaseCalls METRICS_FILE=metrics1.tsv TMP_DIR=[.] MAX_RECORDS_IN_RAM=5000000    PLATFORM=ILLUMINA INCLUDE_BC_IN_RG_TAG=false ADAPTERS_TO_CHECK=[INDEXED, DUAL_INDEXED, NEXTERA_V2, FLUIDIGM] MAX_READS_IN_RAM_PER_TILE=-1 INCLUDE_NON_PF_READS=true MOLECULAR_INDEX_TAG=RX MOLECULAR_INDEX_BASE_QUALITY_TAG=QX BARCODE_POPULATION_STRATEGY=ORPHANS_ONLY INCLUDE_BARCODE_QUALITY=false SO RT=true DISTANCE_MODE=HAMMING MAX_MISMATCHES=1 MIN_MISMATCH_DELTA=1 MAX_NO_CALLS=2 MINIMUM_BASE_QUALITY=0 MINIMUM_QUALITY=2 COMPRESS_OUTPUTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Wed Feb 02 21:51:42 UTC 2022
## METRICS CLASS        picard.illumina.BarcodeMetric
BARCODE BARCODE_WITHOUT_DELIMITER       BARCODE_NAME    LIBRARY_NAME    READS   PF_READS      PERFECT_MATCHES PF_PERFECT_MATCHES      ONE_MISMATCH_MATCHES    PF_ONE_MISMATCH_MATCHES     PCT_MATCHES     RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT  PF_PCT_MATCHES PF_RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT       PF_NORMALIZED_MATCHES
GCGATCTA        GCGATCTA                        250571769       250571769       245862068       245862068       4709701 4709701 0.102852        1       0.102852        1       1.64564
ATAGAGAG        ATAGAGAG                        92371331        92371331        86261451        86261451        6109880 6109880 0.037916        0.368642        0.037916        0.368642    0.606652
AGAGGATA        AGAGGATA                        104698897       104698897       100114893       100114893       4584004 4584004 0.042976        0.41784 0.042976        0.41784 0.687614

This run is with IBtoS directly, but when I run EIB and IBtoS the counts are the same, except that the metrics file produced via the flag METRICS_FILE (and INLINE_BARCODES) doesn't contain the library name; moreover as another issue reported, the NNNN string is reported as all 0s when running IBtoS directly, and so the percentages are different.

Looking at the metrics file when running EIB and IBtoS, I can sum the numbers in the READS column. This total number, which includes reads that matched one of the passed in barcodes and reads that didn't, sums to the total number of reads in the Undetermined_I2 fastq file produced by bcl2fastq. The number of reads reported in the READS column also matches the number of reads in the BAM file.

It seems as if Picard is reporting these as unmatched when they appear as though they should be perfect matches and assigned to a library. I am wondering if there's some other filter that's not obvious from the options that Picard is doing?

Here is an example of a read from the above - you can see it matches the first barcode perfectly. It is not in the resulting bam file.

~>zcat Data/Intensities/BaseCalls/Undetermined_S0_L001_I2_001.fastq.gz | grep -m 1 -A3 A01253:322:HNGMYDSX2:1:1102:19859:22106 
@A01253:322:HNGMYDSX2:1:1102:19859:22106 2:N:0:0
GCGATCTA
+
FFFFFFFF
nchernia commented 2 years ago

Hello,

I've confirmed this bug on a third BCL folder, on a different experiment (this time ChiP-seq, with a simple read structure 76T8B). I am at the Broad and happy to share results with you so you can continue to debug. It's consistently a small but persistent error, affecting a small percentage of reads. For something like ChiP-seq, maybe losing a small number of reads isn't important, but for single cell experiments we don't want to lose any reads, and this bug would force us to use bcl2fastq (which we'd prefer not to do, since we'd then have to rewrite the Picard functionality of mismatch tolerance and writing to bams).

nchernia commented 2 years ago

Please do let me know if you'd like me to share an example BCL together with the other files. They are stored on a Google bucket. My Broad username is neva if you'd like to email instead.