Open hrhotz opened 4 years ago
STARsolo 2.7.5b (the current default on usegalaxy.eu) is broken (or need a different kind of transcripts annotation file?). The same inputs work with a previous tool version (e.g.: 2.7.2b1 )
Ack! Looks like I need to add more tests...
Some useful links:
The GFF3 sniffer is defined as 9 tab-seperated fields, and its test dataset reflects that.
The GTF sniffer is defined as 9 tab-seperated fields (with the first 8 matching the GFF format), with the last Group field containing semicolon pairs, and its test dataset has that.
So from what I can see, it's the 9th column which is the deciding factor.
If we look at Homo_sapiens.GRCh37.75.gtf
given by Zenodo, we see:
chr1 pseudogene gene 11869 14412 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
chr1 processed_transcript transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
chr1 processed_transcript exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002234944";
chr1 processed_transcript exon 12613 12721 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00003582793";
chr1 processed_transcript exon 13221 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002312635";
chr1 transcribed_unprocessed_pseudogene transcript 11872 14412 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-201"; transcript_source "ensembl";
chr1 transcribed_unprocessed_pseudogene exon 11872 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-201"; transcript_source "ensembl"; exon_id "ENSE00002234632";
chr1 transcribed_unprocessed_pseudogene exon 12613 12721 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-201"; transcript_source "ensembl"; exon_id "ENSE00003608237";
chr1 transcribed_unprocessed_pseudogene exon 13225 14412 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "3"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-201"; transcript_source "ensembl"; exon_id "ENSE00002306041";
chr1 transcribed_unprocessed_pseudogene transcript 11874 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000518655"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-202"; transcript_source "ensembl";
So the 9th column, should make this a GTF file and not a GFF. Hmm, the sniffer actually might be broken...
Unfortunately, it is a mess
to write a correct sniffer is near to impossible, unless you want to include some testing on the second columns (i.e. do we have the term "transcript" present) - but that is tricky as well
the easiest is probably to allow gff, gff3 and gtf as input ?
to write a correct sniffer is near to impossible, unless you want to include some testing on the second columns (i.e. do we have the term "transcript" present) - but that is tricky as well
this is something we could definitely do -- I've seen the 9th column tested for keywords such as "gene_id" and "transcript_id"
# Check attributes for gene_id, transcript_id
attributes = parse_gff_attributes(hdr[8])
if len(attributes) >= 2:
if 'gene_id' not in attributes:
return False
if 'transcript_id' not in attributes:
return False
so it wouldn't be too outlandish to test the second column
@mtekman Future reference / note to future self
For testing all uploads:
./run_tests.sh -integration test/integration -- -k "upload_datatype"
For testing very specific uploads:
galaxy/test/integration/test_datatype_upload.py
change the following function to:def test_upload_datatype_auto(instance, test_data, temp_file):
nam = test_data.datatype.__class__.__name__
if nam == "Anndata": # or whatever class you want to test only
upload_datatype_helper(instance, test_data, temp_file)
OR (much much nicer way!)
replace the TEST_CASES = collect_test_data(registry)
line with:
TEST_CASES = collect_test_data(registry)
TEST_CASES = { k:v for (k,v) in TEST_CASES.items()
if k.split('.')[-1].lower() in ["gtf","gff"] }
where k is the full filename of the test dataset
to write a correct sniffer is near to impossible, unless you want to include some testing on the second columns (i.e. do we have the term "transcript" present) - but that is tricky as well
this is something we could definitely do -- I've seen the 9th column tested for keywords such as "gene_id" and "transcript_id"
# Check attributes for gene_id, transcript_id attributes = parse_gff_attributes(hdr[8]) if len(attributes) >= 2: if 'gene_id' not in attributes: return False if 'transcript_id' not in attributes: return False
so it wouldn't be too outlandish to test the second column
The problem is in the code you pasted above, transcript_id
is not present in the "gene" lines, so the check for transcript_id
fails. For context, this code has been there for 11 years since commit 19a956aedbc8b5e9229ac8b0b8f378e487c547ee , probably following https://genome.ucsc.edu/FAQ/FAQformat.html#format4 (which mandates both gene_id
and transcript_id
for every non-comment line).
I will open a PR to just test for gene_id
.
@nsoranzo oh I just saw your comment now
I have tried something similar with this PR: https://github.com/galaxyproject/galaxy/pull/10588
The logic I use is to only reject a GTF file if it lacks both 'gene_id' and 'transcript_id'
If we don't require both gene_id and transcript_id for GTF validation (sniffing) ... and put this out as an example in training material for valid GTF format, that is a problem. Mappers may work Ok with a poorly formatted GTF but most downstream tools are very sensitive about content. More tools accept (and require) GTFs than GFF3s (or some hybrid that tend to default to GFF sniffing).
An out-of-specification reference annotation is one of the most common reason, or actually the most common reason, for odd results or outright tool errors. Mismatched genome inputs come in second. Even if sniffing is modified it won't fix the root problems (and there have been discussions about that -- including: recognize common format issues, split up the header vs content at Upload, assign the best-guess datatype to content post-fix and save/hide the header, etc). Both should also be a wrapper-level check imho, before bothering with tool execution and eventual odd-failure, and should gracefully report a meaningful error message directly in stderr within the error dataset(s) if not a common format issue. Point to FAQs or help we integrate (is always better if users are not redirected away from the core application).
Any header line will also cause validation/sniffing to fail. And cause tools to fail if the user changes the datatype (to get the tool to accept the input) but the content is still problematic. The warning about mismatched datatypes vs data content is not noticed, or rather, likely not understood. Yet many common data providers add in header lines into GFT data for provenance/versioning purposes. Ends up as GFF. User changes to GTF. Mapping usually Ok, downstream tools fail, ask for help, have to start completely over from the start after "fixing" the ref anno.
I know this isn't easy .. but I agree it is more than a papercut. Tool wrappers that consume GFF/GTF/GFF3 need to catch or fix common out-of-specification cases, with a biologist understandable error message if caught-not-auto-fixable.
Is there a reason why tool wrappers cannot strip/ignore header lines in GTFs before submitting the data to tools during execution? If they all did that, we could relax the sniffing. And save users extra steps (remove beginning of file; Select
not matching regex ">#" or ">$"; etc).
Same for GFF3 -- extra comment lines at the start are the biggest issue. Blank lines and interspaced comments lines are second. Couldn't both be handled by tool wrappers at runtime?
I think is important that we make Galaxy much easier to use with regard to ref annotation. However achieved. Big content problems --yes, should error, but not with some odd python or other error ("job exceeds memory allocation" or "exceeds runtime" very common). For small format issues that are very common and known by all as gotchas, we should seriously consider what can be done about making those work better in Galaxy, for both practical reasons and to make end-users happier.
@jennaj So you're suggesting we relax the sniffers to be more flexible with the annotation files (which might have header issues), but to also notify the user when the annotation file is not to spec (to save them from downstream issues) -- yep sounds good to me.
I'm not sure stripping headers before tool execution will work though, as I think STAR requires certain header lines to function properly (https://github.com/alexdobin/STAR/issues/988) and I imagine other mappers will too.
For a paper cut, I think we could implement the user notification for out-of-spec files. Beyond that I'm not certain
With https://github.com/galaxyproject/galaxy/pull/10588 (already in 20.09) we now require only gene_id
in the attributes column, which allows GTF files from Ensembl to be sniffed.
If this is problematic for some tools, we can create a different "Ensembl GTF" datatype.
Update: See https://github.com/galaxyproject/galaxy/issues/10875
This change impacted a few built-in functions, several tools, and leads to a larger discussion about input validation in general.
looks like there are two independent issues here: (ping @jennaj @mtekman @blankenberg )