Arcadia-Science / seqqc

A Nextflow pipeline to identify quality control issues with new sequencing data.
MIT License
28 stars 0 forks source link

Expectations for how long-reads will be QCed and implemented with this workflow #11

Closed elizabethmcd closed 1 year ago

elizabethmcd commented 1 year ago

First brought up in this PR https://github.com/Arcadia-Science/seqqc/pull/8#discussion_r1013540987

As is the workflow takes fastq files and checks for contamination, adapters etc. This workflow is expected to work with both short and long reads. However the direct deliverable from sequencing cores for Illumina short-read data is (almost always) the raw fastqs, but this sometimes isn't the case with long-read data:

  1. For PacBio sometimes the immediate deliverable isn't fastq, I've seen BAM in the past, sometimes the FASTA of the consensus sequence depending on what the sequencing core does
  2. The last time I did a Nanopore run for metagenomes you usually have 100s of tiny fastq/fast5 files from basecalling and after removing adapters with a tool like porechop then it's a single fastq.

For this workflow, the questions are:

  1. If the immediate deliverable for long-reads for a project isn't fastq, are we expecting users to do something outside of the workflow before doing QC
  2. If not, what upstream processes and checks do we add so long-reads can go through basically the same checks as short reads
taylorreiter commented 1 year ago

I'm not sure about the upstream processes, but seqqc is set up to work on FASTQs only, but has been successful on nanopore and pacbio.