Use external hints file

Adamtaranto commented 6 years ago

I'm using BREAKER2 to generate Augustus hints from RNA and Protein alignments. I'd prefer to use these hints as the protein-to-genome alignments found by GenomeThreader (BREAKER2) are usually better than those from BLAT (CAT).

Can I switch off CAT's hint calling steps and import custom hints for each genome instead?

If I process IsoSeq alignments externally will CAT still ensure that any IsoSeq supported transcripts are preserved in the final annotation?

ifiddes commented 6 years ago

You can import custom hints. It has been on my to-do list to make this explicit, but right now it is possible by placing a hints file for each genome in $work_dir/hints_database in the format of $genome.extrinsic_hints.gff. These will be seen by the hints pipeline and compiled into the sqlite database that AUGUSTUS is expecting.

The hints need to be in the right format (the AUGUSTUS format). This means that 'exon' hints are broken up into exonpart (ep) hints with the keys src=W;mult=$N where $N is the number of times it is seen (not necessary, but is used). 'Intron' hints need to be labeled 'intron' with the keys src=E;mult=$N. In total, CAT accepts the following hint types: M (manual -- these get added from transMap), E (exon/intron), W (wiggle), RM (repeat mask), PB (PacBio). See the default augustusCGP config file for more details. (https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/blob/master/augustus_cfgs/cgp_extrinsic_template.cfg). That file has three different parameterizations for weights depending on whether the genome in question has annotation hints, intron-only hints, or full RNA-seq (E and W) hints.

What do you mean by "process IsoSeq alignments externally?". Do you mean running them through something like ICE to produce a annotation GFF? Those can be supplied as an annotation, and will be used by AugustusCGP, but will not necessarily make it into the final annotation set, depending on what AUGUSTUS does with the data. CAT won't know they are derived from IsoSeq in that mode, it will treat them as an annotation set.

In the standard augPB mode, you provide mapped FLNCs and the BAM is used to generate hints that are combined with any other hints into that same hints file as above, and then are used to run AUGUSTUS in single-genome mode with the option to hint for alternative isoforms turned on. Those transcripts are then incorporated if they are supported and different from transMap derived transcripts.

So in short, the answer is probably no. It is in my to-do list also to allow for external annotation sets to be added directly to the consensus process (Issue #92).

Let me know if you have any other questions!

Adamtaranto commented 6 years ago

Hi Ian,

Thanks for that. I have a few follow up questions.

1) If I include a custom $genome.extrinsic_hints.gff file will CAT merge those hints with ones that it generates from other inputs (protein, RNA-seq bam, etc.), or skip those steps entirely and only use the hint file provided?

2) It seems that CAT does not use hints from the IsoSeq bamfile for AugustusCGP. Is that correct?

3) Will regions of the non-reference genomes that do not align to the reference, but have their own hints, be annotated by CAT?

My primary objective is to make sure all IsoSeq supported transcripts are represented in the final annotation as variants of appropriate parent genes so that they are available for isoform switching analyses. I'd be keen to see #92 added to CAT.

Adam

ifiddes commented 6 years ago

If I include a custom $genome.extrinsic_hints.gff file will CAT merge those hints with ones that it generates from other inputs (protein, RNA-seq bam, etc.), or skip those steps entirely and only use the hint file provided?

Unfortunately no. It will skip those steps entirely at this point. As a hack around that, you could let the hints step finish, kill the pipeline, concatenate your file with the CAT-generated file, then let it proceed again. (if the step finishes in its entirety, then you would want to delete the sqlite database, the pipeline will detect the presence of the GFF files and the absence of the database and build it).

It seems that CAT does not use hints from the IsoSeq bamfile for AugustusCGP. Is that correct?

No, it should be. AugustusCGP talks directly to the hints database, which includes the IsoSeq hints.

Will regions of the non-reference genomes that do not align to the reference, but have their own hints, be annotated by CAT?

Only in the AugustusPB mode. Another feature I have thought about adding is a generalized single-genome AUGUSTUS similar to what MAKER2 does. AugustusCGP relies on alignments being converted to MAF format, which precludes regions that did not align to the reference from being considered.

My primary objective is to make sure all IsoSeq supported transcripts are represented in the final annotation as variants of appropriate parent genes so that they are available for isoform switching analyses. I'd be keen to see #92 added to CAT.

This makes sense. I think a first step to accomplishing this is just seeing what AugustusPB produces given the hints. Anything AugustusPB produces that is actually supported by the IsoSeq data will be included in the final set as long as it doesn't duplicate what transMap produced, and will be assigned a parent where possible.

ComparativeGenomicsToolkit / Comparative-Annotation-Toolkit

Use external hints file #96