BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
207 stars 71 forks source link

Feature Request: `Flair collapse` does not support gzip compressed reference FASTAs #374

Open maxgmarin opened 1 day ago

maxgmarin commented 1 day ago

Hello all,

I have found that the flair collapse step will return a UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte error when a gzipped FASTA is provided for the reference.

This issue is completely resolved when I provide the same reference as a FASTA file (no compression).

This isn't a major issue but for future versions of the tool it would be nice if it was compatible with both compressed and uncompressed FASTA files.

Below are some notes:

How did you install Flair?

  1. bioconda (e.g. conda create -n flair -c conda-forge -c bioconda flair)

What happened?


Writing temporary files to /tmp/tmp2cgr1c6x/
Making transcript fasta using annotated gtf and genome sequence
Traceback (most recent call last):
  File "/homes6/marin/miniforge/envs/flair_wismk/lib/python3.10/site-packages/flair/bed_to_sequence.py", line 253, in <module>         
    for line in open(args.genome):
  File "/homes6/marin/miniforge/envs/flair_wismk/lib/python3.10/codecs.py", line 322, in decode                                        
    (result, consumed) = self._buffer_decode(data, self.errors, final)                                                                 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte                                             
Traceback (most recent call last):
  File "/homes6/marin/miniforge/envs/flair_wismk/bin/flair", line 10, in <module>                                                      
    sys.exit(main())
  File "/homes6/marin/miniforge/envs/flair_wismk/lib/python3.10/site-packages/flair/flair.py", line 1035, in main                      
    status = collapse()
  File "/homes6/marin/miniforge/envs/flair_wismk/lib/python3.10/site-packages/flair/flair.py", line 526, in collapse                   
    subprocess.check_call([sys.executable, path+'bed_to_sequence.py', args.o+'annotated_transcripts.bed', args.g, args.annotation_reliant])
  File "/homes6/marin/miniforge/envs/flair_wismk/lib/python3.10/subprocess.py", line 369, in check_call                                
    raise CalledProcessError(retcode, cmd)```
diekhans commented 1 day ago

This frustrates me as well. I have code I can contribute to read compressed files. It would be good to allow this for all input.

It uses a separate process to uncompress, so it doesn't require the python compression code to be compiled in and might be faster as it runs in parallel.

Supporting samtools faidx fasta files would also be good, as it would allow better parallelization of flair.