galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.37k stars 992 forks source link

Import of fastq files from ENA get assigned `fastqsanger` instead of `fastqsanger.gz` #6900

Open lparsons opened 5 years ago

lparsons commented 5 years ago

I believe this is due to the addition of the fastqsanger.gz filetype. The ENA is "assigning" a filetype of fastqsanger (which used to work) and Galaxy is accepting that, and even shows a correct "peek" of the fastq file in the history. However, tools (e.g. FastQC) run against the file will fail, complaining about the format (e.g. line does start with the @ character).

This isn't really a Galaxy bug per se, but it is an issue with the Galaxy experience for users.

mvdbeek commented 5 years ago

I think we should probably disregard filetypes sent by external parties at this point. Seems we'd be better off relying on our sniffers.

lparsons commented 5 years ago

I'd be in favor of an additional flag to force override of sniffers. That way servers that aren't updated (ENA) would get the new behavior, but things that want to be specific still could. The main issue I see is that fastq sniffers generally assign type "fastq" and not "fastqsanger", rendering the files useless without a completely pointless Fastq Groomer run. Unless that behavior has changed?

martenson commented 5 years ago

We are sniffing fastqsanger since https://github.com/galaxyproject/galaxy/pull/4237 (does not cover all cases obviously)

mvdbeek commented 5 years ago

Yes, we sniff fastqsanger if the quality values are compatible with sanger encoding. We will also soon have a colorspace sniffer (but that data isn't much used anymore). Everything else will be flat fastq, as the Illumina and Solexa variants are not easy to discriminate.