PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
243 stars 44 forks source link

Lima failing to detect CCS data after Skera de-concatination & Bam2Fastq conversion #681

Closed roylejw closed 1 month ago

roylejw commented 3 months ago

Operating system Amazon Linux 2

Package name Lima v 2.9 Skera v1.2

Conda environment

# packages in environment at /home/ec2-user/miniconda3/envs/pbtk:
#
# Name                    Version                   Build  Channel
lima                      2.9.0                h9ee0642_1    bioconda
pbskera                   1.2.0                hdfd78af_0    bioconda
pbtk                      3.1.0                h9ee0642_0    bioconda

Describe the bug Lima is failing to identify CCS readsafter using pbskera to de-concat into s-reads (through SMRTLink and via command line), and bam2fastq conversion. This was not an issue prior to Kinnex datasets - our workflow on native fastq-converted reads did not cause this error. I noticed only one other person reporting this bug, but back in 2021 and doesn't appear to be relevant.

Error message 20240430 04:09:25.288 | WARN | Attention! You are trying to demultiplex non CCS data. CLR demultiplexing is only supported with BAM/XML input! Will proceed to demultiplex each sequence individually, not grouped by ZMW!

To Reproduce

  1. De-concat raw reads with skera
  2. bam2fastq -u -o reads skera.bam
  3. lima --hifi-preset ASYMMETRIC --biosample-csv barcode-sample-16S.csv --split-named --output-missing-pairs input.fastq kinnex16S.fasta demux.fastq

Expected behavior Our workflow remains the same pre-kinnex and post-kinnex data, with this error only occuring with kinnex datasets. It is perhaps due to a header change. The new fastq header contains an extra set of info compared to the old datasets: Old: @m84073_240328_065715_s1/133239718/ccs New: @m84073_240426_082659_s4/250483516/ccs/16_1598

armintoepfer commented 1 month ago

Please use BAM throughout your data processing pipeline. We rely on BAM tags to annotate reads.