Protein alignments never utilized in pipeline b/c exonerate output is incompatible with `exonerate2hints.pl`

ComparativeGenomicsToolkit / Comparative-Annotation-Toolkit

Apache License 2.0

170 stars 48 forks source link

Protein alignments never utilized in pipeline b/c exonerate output is incompatible with `exonerate2hints.pl` #265

Open nspope opened 3 years ago

nspope commented 3 years ago

Hi, thanks for developing this nifty tool.

https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/blob/fc1623da5df1309d2e2f0b9bb0363aaab84708f4/cat/hints_db.py#L309-L310

outputs a GFF using coordinates with respect to the query sequence--that is, the input proteins (with strand, etc, set to '.'). This GFF doesn't have exon features and can't be parsed by exonerate2hints.pl, which silently outputs an empty file. This means the protein alignments never make it into Augustus, and there's no indication of this failure for the user.

A usable GFF is generated by replacing --showquerygff with --showtargetgff in the above.

Nate

nspope commented 3 years ago

just to follow up, even with the fix above, hints will be ignored unless the CGP extrinsic config file is modified appropriately (e.g. has XNT fields for each [GENERAL] table and species names after corresponding [GROUP]). Looks like that's not the case currently (though I haven't looked carefully).

Just a PSA for people using the current commit; protein alignments are costly to generate and this information isn't being used anywhere that I can see.

marcopessoa commented 3 months ago

Is this true for the current commit? Our exonerate alignments take days and it would be good to know whether they are used or not

nspope commented 3 months ago

It looks like the exonerate invocation linked to above is the same in the most recent commit. As to whether this is still an issue, I have no idea -- best way to tell would be to check the output of exonerate2hints.pl (assuming that's still being used in the pipeline).