Open nchernia opened 2 years ago
An update: I have tried running with a simpler read structure (50T99S8B50T) and this did not change the results. I have also tried deleting the barcodes files and running just IlluminaBasecallsToSam with MATCH_BARCODES_INLINE=true and this also did not change the results.
Hi @nchernia have you tried running this on a newer version of Picard (the latest release is 2.26.10)? We fixed a number of errors in EIB and IBCTo[Sam|Fastq] last year (although this does not sound like one of those).
Also, is it possible that the missing barcodes are those that do not exactly match the input barcodes list (have more than one error)? That would be indicated in the metrics file generated by EIB. If possible, can you upload that file too?
Thanks.
George
Hi George,
Thanks for getting back to me.
I've rerun with the latest jar and the results are the same.
Here is the head of the metrics file.
## htsjdk.samtools.metrics.StringHeader
# IlluminaBasecallsToSam RUN_BARCODE=A01253:322:HNGMYDSX2 SEQUENCING_CENTER=BI LIBRARY_PARAMS=/mnt/disks/big_data/211110_SL-NVV_0322_AHNGMYDSX2/library_params1.tsv NUM_PROCESSORS=0
APPLY_EAMSS_FILTER=false IGNORE_UNEXPECTED_BARCODES=true MATCH_BARCODES_INLINE=true LANE=[1] READ_STRUCTURE=50T14S10M28S10M28S9M8B50T BASECALLS_DIR=Data/Intensities/BaseCalls METRICS_FILE=metrics1.tsv TMP_DIR=[.] MAX_RECORDS_IN_RAM=5000000 PLATFORM=ILLUMINA INCLUDE_BC_IN_RG_TAG=false ADAPTERS_TO_CHECK=[INDEXED, DUAL_INDEXED, NEXTERA_V2, FLUIDIGM] MAX_READS_IN_RAM_PER_TILE=-1 INCLUDE_NON_PF_READS=true MOLECULAR_INDEX_TAG=RX MOLECULAR_INDEX_BASE_QUALITY_TAG=QX BARCODE_POPULATION_STRATEGY=ORPHANS_ONLY INCLUDE_BARCODE_QUALITY=false SO RT=true DISTANCE_MODE=HAMMING MAX_MISMATCHES=1 MIN_MISMATCH_DELTA=1 MAX_NO_CALLS=2 MINIMUM_BASE_QUALITY=0 MINIMUM_QUALITY=2 COMPRESS_OUTPUTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Wed Feb 02 21:51:42 UTC 2022
## METRICS CLASS picard.illumina.BarcodeMetric
BARCODE BARCODE_WITHOUT_DELIMITER BARCODE_NAME LIBRARY_NAME READS PF_READS PERFECT_MATCHES PF_PERFECT_MATCHES ONE_MISMATCH_MATCHES PF_ONE_MISMATCH_MATCHES PCT_MATCHES RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT PF_PCT_MATCHES PF_RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT PF_NORMALIZED_MATCHES
GCGATCTA GCGATCTA 250571769 250571769 245862068 245862068 4709701 4709701 0.102852 1 0.102852 1 1.64564
ATAGAGAG ATAGAGAG 92371331 92371331 86261451 86261451 6109880 6109880 0.037916 0.368642 0.037916 0.368642 0.606652
AGAGGATA AGAGGATA 104698897 104698897 100114893 100114893 4584004 4584004 0.042976 0.41784 0.042976 0.41784 0.687614
This run is with IBtoS directly, but when I run EIB and IBtoS the counts are the same, except that the metrics file produced via the flag METRICS_FILE (and INLINE_BARCODES) doesn't contain the library name; moreover as another issue reported, the NNNN string is reported as all 0s when running IBtoS directly, and so the percentages are different.
Looking at the metrics file when running EIB and IBtoS, I can sum the numbers in the READS column. This total number, which includes reads that matched one of the passed in barcodes and reads that didn't, sums to the total number of reads in the Undetermined_I2 fastq file produced by bcl2fastq. The number of reads reported in the READS column also matches the number of reads in the BAM file.
It seems as if Picard is reporting these as unmatched when they appear as though they should be perfect matches and assigned to a library. I am wondering if there's some other filter that's not obvious from the options that Picard is doing?
Here is an example of a read from the above - you can see it matches the first barcode perfectly. It is not in the resulting bam file.
~>zcat Data/Intensities/BaseCalls/Undetermined_S0_L001_I2_001.fastq.gz | grep -m 1 -A3 A01253:322:HNGMYDSX2:1:1102:19859:22106
@A01253:322:HNGMYDSX2:1:1102:19859:22106 2:N:0:0
GCGATCTA
+
FFFFFFFF
Hello,
I've confirmed this bug on a third BCL folder, on a different experiment (this time ChiP-seq, with a simple read structure 76T8B). I am at the Broad and happy to share results with you so you can continue to debug. It's consistently a small but persistent error, affecting a small percentage of reads. For something like ChiP-seq, maybe losing a small number of reads isn't important, but for single cell experiments we don't want to lose any reads, and this bug would force us to use bcl2fastq (which we'd prefer not to do, since we'd then have to rewrite the Picard functionality of mismatch tolerance and writing to bams).
Please do let me know if you'd like me to share an example BCL together with the other files. They are stored on a Google bucket. My Broad username is neva if you'd like to email instead.
Bug Report
Affected tool(s)
IlluminaBasecallsToSam (possibly ExtractIlluminaBarcodes)
Affected version(s)
Version:2.26.3
Description
I am using ExtractIlluminaBarcodes and IlluminaBasecallsToSam as a first demultiplexing step to separate reads by library barcode. I am comparing to doing two passes via bcl2fastq and then directly examining the index reads to demultiplex by library.
The bam files produced by IlluminaBasecallsToSam are missing barcodes that exist in bcl2fastq. These are exact matches with high quality basecall strings, e.g. FFFFFFFF. I have tested this on two different bcl files so far and in both cases there's approximately 0.1% of reads missing.
Since ExtractIlluminaBarcodes is called before IlluminaBasecallsToSam and reports the number of matches per library (which is equal to the number of reads in the corresponding library bam file), this might be an error in ExtractIlluminaBarcodes instead of IlluminaBasecallsToSam.
Steps to reproduce
barcode_params.tsv
library_params1.tsv
Expected behavior
All reads with matching library barcode are in appropriate bam file
Actual behavior
0.1% of reads with exact match barcodes are missing.