Enhancement: Have the fastq datatype sniffers check EOF and warn the user if truncated

jennaj commented 5 years ago

Partial fastq uploads trip up many users. They only discover the problem when tools using the truncated dataset error out. The errors often report that something is wrong with the formatting, but in various ways between tools and not always in stderr. This is difficult for end users to interpret -- we should be able programmatically trap this case earlier, at Upload.

Fastq format is standardized and only the last read record would need to be checked to ensure that all four lines exist and that the sequence and quality score lengths are the same (or in certain cases, the sequence is exactly one base longer).

This would catch the majority of reported problems due to truncated fastq uploads. Internal formatting problems are a separate, distinct issue -- so I'm not suggesting we do full validation at Upload (for a few reasons, plus we already have a ticket for that if/when interested in tackling it).

Undecided if this should result in a green dataset or a red dataset ... probably red is best, as the data is not in a usable format and green datasets seem to be rarely expected to be problematic by users. But either way, a biologist-friendly message should be reported in stderr by the Upload tool, so they can see it in the expanded dataset (and if red, when clicking on the bug icon).

Example of a tool that reports useful info when a fastq dataset has a formatting problem (any, internal or EOF truncated): Fastq Groomer. But that tool is rarely used now... and takes time/resources/quota space to run when an earlier, direct EOF check is often enough.

Thoughts?

mvdbeek commented 5 years ago

We don't have warnings in Galaxy, we can only error, which I guess would be OK.

This would catch the majority of reported problems due to truncated fastq uploads.

Do you have numbers on how often this happens ? Why does it happen ? What happens for non-fastq uploads ? Will people not be equally blocked for all other uploaded data ? I don't think that in my 7 years of using Galaxy I ever had this happen to me (of course it can happen, just want to know if there is maybe something systematically wrong at another point in the chain ?).

mvdbeek commented 5 years ago

(or in certain cases, the sequence is exactly one base longer).

what case would that be ?

jennaj commented 5 years ago

@mvdbeek I don't know how often it actually happens, but it is part of reported issues about 20x a month at the .org server in histories I actually look at in detail, and that jumps up during the very start of semesters when we tend to have more students using Galaxy for the first time. It is a usage stumble with local-file browser uploads (common) or FTP upload related (less common, but happens -- data in a mid-transfer state appears in the FTP upload list and can be imported into a dataset while still being transferred). We've talked about not having files show up until a transfer is "complete" - but that's a bit more complicated to check for since we don't know the actual full data size on the server side, only the client tracks transfer status as complete/partial (whether using FTP line command or with a client). So, usually involves newer users but even experienced users sometimes run into it (and often apologize for overlooking a transfer status or QA OEF check they would never skip when moving data around line-command).

Fastq data is the most common datatype that presents this -- probably because tends to be the largest file content we don't error trap at upload (truncated BAM transfers are caught already and few people upload SAM data).

Second most common is data from the UCSC table browser (which does report the data is truncated in text inside the dataset, at the very end). Thes are usually GTF, MAF, or sequence data. Any extracted data > ~100k lines will have their standard warning inside the file, at the end, noting this was not a complete transfer (truncated result). Has been reported/seen much less often last 2-3 years, probably because the UCSC GTFs are not used as much as before (has scientific content issues - specifically, gene_id and transcript_id are the same value - effectively making any analysis "by gene" actually "by transcript"). UCSC warns about this on the TB web page where they state to get any large data another way (eg Downloads area or do an SQL query) -- it just isn't as easy to use the table data files (many need to be transformed with UCSC's utilities first, often with more re-formatting/joins needed, to create a standardized format with the right content, before uploading to Galaxy). People miss the warning or maybe don't realize their data query will hit the limit or are just new. ORG has many many new and student users.

Re: why some fastq seq/qual lengths won't match: An older fastq format has an adaptor base at the start of the sequence that isn't always padded with a corresponding "#" quality score. Very rare now since tools don't work with the data directly anymore, data providers usually pad it (probably in some part to help with their own data validation processes), and people don't sequence that way on their own that way anymore. Still - the data putatively can be used in Galaxy -- it just takes some manipulation first. So I don't think we should flag those as being an error. The extra adaptor base would always be a G or T in the sequence if we wanted to build that into the checking rule.

Re: checking all uploaded data -- We could check the last few lines of nearly any dataset to see if it appears to be truncated or has a warning from the source ("one of these lines is not like the others" -- joke, from the sesame street song :) ) but that's a larger ask and I didn't want to complicate this too much or hold it back for fastq (building rules for other data explodes the scope). For example, fasta would be complicated to check programmatically for a few reasons and I don't think we should do it, or at not least not right now (context rules could be applied ... and wouldn't work well in some cases anyway, so would be better as a warning and not a failure). The exception would be data from the UCSC TB and the file included their warning present at the end of the file -- that would be trivial to check for.

SO, the punchline: most any data in a structured format could be checked to see if it appears to be truncated. But I'd suggest starting with fastq and then decide if others are worth checking.

bernt-matthias commented 5 years ago

I guess it would be better to have some hash based upload (and download) verification mechanism. This would then also cover all data types and does not need to make assumptions on the file content.

As a quick "fix" also uploading/downloading zipped files might help. If the file was trancated unzipping usually does not work. (But this won't help for zipped data types .. like fastq.gz which are not extracted on upload)

jennaj commented 5 years ago

Please consider for 19.05

galaxyproject / galaxy

Enhancement: Have the fastq datatype sniffers check EOF and warn the user if truncated #6775