STARsolo 2.7.5b (Tutorial Feedback #989 Oct 29th 2020)

hrhotz commented 4 years ago

looks like there are two independent issues here: (ping @jennaj @mtekman @blankenberg )

Yes, Homo_sapiens.GRCh37.75.gtf is not correctly auto detected as 'gtf' when uploaded from zenodo. I have seen this before (on other occasions), and it needs a fix in the sniffer (probably more than just a 'paper cut'?)
STARsolo 2.7.5b (the current default on usegalaxy.eu) is broken (or need a different kind of transcripts annotation file?). The same inputs work with a previous tool version (e.g.: 2.7.2b1 )

mtekman commented 4 years ago

STARsolo 2.7.5b (the current default on usegalaxy.eu) is broken (or need a different kind of transcripts annotation file?). The same inputs work with a previous tool version (e.g.: 2.7.2b1 )

Ack! Looks like I need to add more tests...

mtekman commented 4 years ago

Some useful links:

The GFF3 sniffer is defined as 9 tab-seperated fields, and its test dataset reflects that.
The GTF sniffer is defined as 9 tab-seperated fields (with the first 8 matching the GFF format), with the last Group field containing semicolon pairs, and its test dataset has that.

So from what I can see, it's the 9th column which is the deciding factor.

If we look at Homo_sapiens.GRCh37.75.gtf given by Zenodo, we see:

chr1    pseudogene  gene    11869   14412   .   +   .   gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
chr1    processed_transcript    transcript  11869   14409   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
chr1    processed_transcript    exon    11869   12227   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002234944";
chr1    processed_transcript    exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00003582793";
chr1    processed_transcript    exon    13221   14409   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002312635";
chr1    transcribed_unprocessed_pseudogene  transcript  11872   14412   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-201"; transcript_source "ensembl";
chr1    transcribed_unprocessed_pseudogene  exon    11872   12227   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-201"; transcript_source "ensembl"; exon_id "ENSE00002234632";
chr1    transcribed_unprocessed_pseudogene  exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-201"; transcript_source "ensembl"; exon_id "ENSE00003608237";
chr1    transcribed_unprocessed_pseudogene  exon    13225   14412   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "3"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-201"; transcript_source "ensembl"; exon_id "ENSE00002306041";
chr1    transcribed_unprocessed_pseudogene  transcript  11874   14409   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000518655"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-202"; transcript_source "ensembl";

So the 9th column, should make this a GTF file and not a GFF. Hmm, the sniffer actually might be broken...

hrhotz commented 4 years ago

Unfortunately, it is a mess

GTF is GFF2 (https://www.ensembl.org/info/website/upload/gff.html)
GFF3 can also contain semicolon separated pairs (http://www.ensembl.org/info/website/upload/gff3.html) see also: http://gmod.org/wiki/GFF3

to write a correct sniffer is near to impossible, unless you want to include some testing on the second columns (i.e. do we have the term "transcript" present) - but that is tricky as well

the easiest is probably to allow gff, gff3 and gtf as input ?

mtekman commented 4 years ago

to write a correct sniffer is near to impossible, unless you want to include some testing on the second columns (i.e. do we have the term "transcript" present) - but that is tricky as well

this is something we could definitely do -- I've seen the 9th column tested for keywords such as "gene_id" and "transcript_id"

                # Check attributes for gene_id, transcript_id
                attributes = parse_gff_attributes(hdr[8])
                if len(attributes) >= 2:
                    if 'gene_id' not in attributes:
                        return False
                    if 'transcript_id' not in attributes:
                        return False

so it wouldn't be too outlandish to test the second column

mtekman commented 4 years ago

@mtekman Future reference / note to future self

For testing all uploads:

 ./run_tests.sh -integration test/integration -- -k "upload_datatype"

For testing very specific uploads:

in galaxy/test/integration/test_datatype_upload.py change the following function to:

def test_upload_datatype_auto(instance, test_data, temp_file):
    nam = test_data.datatype.__class__.__name__
    if nam == "Anndata":   # or whatever class you want to test only
        upload_datatype_helper(instance, test_data, temp_file)

OR (much much nicer way!)

replace the TEST_CASES = collect_test_data(registry) line with:

 TEST_CASES = collect_test_data(registry)
 TEST_CASES = { k:v for (k,v) in TEST_CASES.items()
               if k.split('.')[-1].lower() in ["gtf","gff"] }

where k is the full filename of the test dataset

nsoranzo commented 4 years ago

to write a correct sniffer is near to impossible, unless you want to include some testing on the second columns (i.e. do we have the term "transcript" present) - but that is tricky as well

this is something we could definitely do -- I've seen the 9th column tested for keywords such as "gene_id" and "transcript_id"
                # Check attributes for gene_id, transcript_id
                attributes = parse_gff_attributes(hdr[8])
                if len(attributes) >= 2:
                    if 'gene_id' not in attributes:
                        return False
                    if 'transcript_id' not in attributes:
                        return False
so it wouldn't be too outlandish to test the second column

The problem is in the code you pasted above, transcript_id is not present in the "gene" lines, so the check for transcript_id fails. For context, this code has been there for 11 years since commit 19a956aedbc8b5e9229ac8b0b8f378e487c547ee , probably following https://genome.ucsc.edu/FAQ/FAQformat.html#format4 (which mandates both gene_id and transcript_id for every non-comment line).

I will open a PR to just test for gene_id .

mtekman commented 4 years ago

@nsoranzo oh I just saw your comment now

I have tried something similar with this PR: https://github.com/galaxyproject/galaxy/pull/10588

The logic I use is to only reject a GTF file if it lacks both 'gene_id' and 'transcript_id'

jennaj commented 3 years ago

If we don't require both gene_id and transcript_id for GTF validation (sniffing) ... and put this out as an example in training material for valid GTF format, that is a problem. Mappers may work Ok with a poorly formatted GTF but most downstream tools are very sensitive about content. More tools accept (and require) GTFs than GFF3s (or some hybrid that tend to default to GFF sniffing).

An out-of-specification reference annotation is one of the most common reason, or actually the most common reason, for odd results or outright tool errors. Mismatched genome inputs come in second. Even if sniffing is modified it won't fix the root problems (and there have been discussions about that -- including: recognize common format issues, split up the header vs content at Upload, assign the best-guess datatype to content post-fix and save/hide the header, etc). Both should also be a wrapper-level check imho, before bothering with tool execution and eventual odd-failure, and should gracefully report a meaningful error message directly in stderr within the error dataset(s) if not a common format issue. Point to FAQs or help we integrate (is always better if users are not redirected away from the core application).

Any header line will also cause validation/sniffing to fail. And cause tools to fail if the user changes the datatype (to get the tool to accept the input) but the content is still problematic. The warning about mismatched datatypes vs data content is not noticed, or rather, likely not understood. Yet many common data providers add in header lines into GFT data for provenance/versioning purposes. Ends up as GFF. User changes to GTF. Mapping usually Ok, downstream tools fail, ask for help, have to start completely over from the start after "fixing" the ref anno.

I know this isn't easy .. but I agree it is more than a papercut. Tool wrappers that consume GFF/GTF/GFF3 need to catch or fix common out-of-specification cases, with a biologist understandable error message if caught-not-auto-fixable.

Is there a reason why tool wrappers cannot strip/ignore header lines in GTFs before submitting the data to tools during execution? If they all did that, we could relax the sniffing. And save users extra steps (remove beginning of file; Select not matching regex ">#" or ">$"; etc).

Same for GFF3 -- extra comment lines at the start are the biggest issue. Blank lines and interspaced comments lines are second. Couldn't both be handled by tool wrappers at runtime?

I think is important that we make Galaxy much easier to use with regard to ref annotation. However achieved. Big content problems --yes, should error, but not with some odd python or other error ("job exceeds memory allocation" or "exceeds runtime" very common). For small format issues that are very common and known by all as gotchas, we should seriously consider what can be done about making those work better in Galaxy, for both practical reasons and to make end-users happier.

mtekman commented 3 years ago

@jennaj So you're suggesting we relax the sniffers to be more flexible with the annotation files (which might have header issues), but to also notify the user when the annotation file is not to spec (to save them from downstream issues) -- yep sounds good to me.

I'm not sure stripping headers before tool execution will work though, as I think STAR requires certain header lines to function properly (https://github.com/alexdobin/STAR/issues/988) and I imagine other mappers will too.

For a paper cut, I think we could implement the user notification for out-of-spec files. Beyond that I'm not certain

nsoranzo commented 3 years ago

With https://github.com/galaxyproject/galaxy/pull/10588 (already in 20.09) we now require only gene_id in the attributes column, which allows GTF files from Ensembl to be sniffed. If this is problematic for some tools, we can create a different "Ensembl GTF" datatype.

jennaj commented 3 years ago

Update: See https://github.com/galaxyproject/galaxy/issues/10875

This change impacted a few built-in functions, several tools, and leads to a larger discussion about input validation in general.

galaxyproject / training-material

STARsolo 2.7.5b (Tutorial Feedback #989 Oct 29th 2020) #2103