ConesaLab / SQANTI3

Tool for the Quality Control of Long-Read Defined Transcriptomes
GNU General Public License v3.0
198 stars 49 forks source link

[FEATURE] support gzip'd input fastq files #312

Closed nick-youngblut closed 2 months ago

nick-youngblut commented 4 months ago

Is there an existing issue for this?

Have you loaded the SQANTI3.env conda environment?

Problem description

Using a gzip'd fastq input file for sqanti3_qc.py throws the following error:

Rscript (R) version 4.3.3 (2024-02-29)
Cleaning up isoform IDs...
Traceback (most recent call last):
  File "/home/nickyoungblut/dev/bfx/SQANTI3-5.2.1/./sqanti3_qc.py", line 2525, in <module>
    main()
  File "/home/nickyoungblut/dev/bfx/SQANTI3-5.2.1/./sqanti3_qc.py", line 2445, in main
    args.isoforms = rename_isoform_seqids(args.isoforms, args.force_id_ignore)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nickyoungblut/dev/bfx/SQANTI3-5.2.1/./sqanti3_qc.py", line 2131, in rename_isoform_seqids
    if h.readline().startswith('@'): type = 'fastq'
       ^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Code sample

Instead of:

with open(input_fasta) as h:
   if h.readline().startswith('@'): type = 'fastq'

...just update to:

if input_fasta.endswith('.gz'):
    open_func = gzip.open
else:
    open_func = open
with open_func(input_fasta) as h:
   if h.readline().startswith('@'): type = 'fastq'

Error

No response

Anything else?

Given the size of read files, it would be quite helpful to allow for gzip'd input. I'm quite surprised that gzip'd input is not supported.