Gaius-Augustus / TSEBRA

TSEBRA: Transcript Selector for BRAKER
46 stars 5 forks source link

Complementing curated genes with TSEBRA + other concerns #13

Closed nicmoya closed 2 years ago

nicmoya commented 2 years ago

Hi Lars,

My team has recently curated over 20k genes for a novel Caenorhabditis genome we have assembled. I noticed that we underperform in sensitivity relative to TSEBRA annotations (generated by integrating BRAKER1, BRAKER2). I was wondering if there was a way to integrate external features from GFF/GTF that was not generated using the BRAKER platform into TSEBRA (we can perhaps match the BRAKER/AUGUSTUS gtf format, but we cannot generate hints as these are manual curations).

Additionally, I wanted to ask if the AUGUSTUS scripts for conversion to GFF (gtf2gff.pl) and translation of proteins (getAnnoFasta.pl) were appropriate / can handle the TSEBRA output.

Lastly, how would I go about incorporating GeneMarkS-T predictions from long-read RNAseq? I ran the GeneMarkS-T script on my HQ PacBio RNA transcripts, but I am unsure how to utilize the output GFF from GeneMark with TSEBRA (particularly because of the odd format).

Best, Nicolas Moya

nicmoya commented 2 years ago

I tried using the solution in #9 to convert the TSEBRA GTF to GFF using gtf2gff.pl but it seems that the output GFF remains identical to the input TSEBRA GTF. No errors, simply no changes to the file. What could potentially be the issue here?

LarsGab commented 2 years ago

Hi Nicolas,

TSEBRA can use any GTF/GFF file as an input gene set as long as it has the 'transcript_id' attribute in its last column, it does not have to be from BRAKER. However, it needs extrinsic evidence in form of intron positions to compare all candidate transcripts for each locus and to filter the transcripts. What gene sets/external features do you want to combine with TSEBRA? Do you want to combine them (select a subset of all input transcripts) or do you want to merge them (keep all transcripts from the input gene sets)?

Regarding gtf2gff.pl: The output of TSEBRA is already in the standard format of the gtf2gff.pl output, that's why nothing has changed. If you need your gene set in gff3 format, you can use the option --gff3 of gtf2gff.pl to reformat your TSEBRA output to gff3. The script getAnnoFasta.pl should work with a TSEBRA output. Previously, I always used the script gtf2aa.pl to extract the protein sequences from a TSBERA gene set.

Regarding GeneMarkS-T: You can use this script to generate a GTF file from your GeneMarkS-T and your long-read transcripts, which can be used as an input gene set for TSEBRA. If you want to integrate long-reads with TSEBRA, you might want to take a look at our preliminary protocol for long-read integration using BRAKER and TSEBRA. We introduced this protocol as part of a talk we presented at PAG this year (you can find the slides here).

Best, Lars

minhasbushra commented 2 years ago

@LarsGab Related to getAnnoFasta.pl, I am trying to get it to work with tsebra output, but it doesn't seem to work. It runs without an error and there is no output. By any chance can you share the command.

Thanks for your help. Bushra

nicmoya commented 2 years ago

Hi Lars, thanks for the prompt reply.

Thank you for sharing info on the long-read branches of BRAKER and TSEBRA. I did a quick read-through, and I'm slightly confused about the purpose of the long-read branch for BRAKER. If the long-read gene models are generated using GeneMarkS-T, and then integrated with BRAKER1 and BRAKER2 predictions using the long-read branch of TSEBRA, could I just simply use the official BRAKER release to make the BRAKER1 and BRAKER2 gene prediction sets and associated hints files?

I was able to successfully use gtf2gff.pl and gtf2aa.pl. Thanks for your suggestions.

Regarding the gene sets I want to integrate with TSEBRA, we have manually curated a set of genes by comparing gene models from BRAKER1 against StringTie+TransDecoder long-read based gene models. We loaded BRAKER1 and StringTie+TransDecoder models along with short- and long-read RNA-seq alignments into Apollo, and manually inspected each locus for inconsistencies between the predicted models and the underlying RNA-seq evidence. We made manual edits on each gene and generated a single non-redundant annotation, but the curation process was community based and some error may have been introduced (we see loss of some orthologs after undergoing curation). We thought we could potentially complement our manual curations using TSEBRA. We wanted to leverage BRAKER1 and BRAKER2 models against our manual curations using short- and long-read RNA-seq intron hints to identify the best gene models from each set. This may not be possible, as the extrinsic evidence used to curate our genes comes from two separate sources (long and short-read RNA-seq). What do you think?

LarsGab commented 2 years ago

@Bushra You can use following command: getAnnoFasta.pl --seqfile genome.fa tsebra.gtf However, I noticed that it only gives the coding sequences as output and not the protein sequences (I don't know why it doesn't translate them). But you could use gtf2aa.pl, if you want the protein sequences: gtf2aa.pl genome.fa tsebra.gtf protein_output.fa

@Nicolas The long-read branch is only there for developing reasons. At the moment it only differs compared to the main branch in some additional scripts and not in the actual BRAKER code. You can use your normal BRAKER version/branch to run BRAKER1 and BRAKER2.

To summarize: If I understand it correctly you have a manually curated gene set that is very high specificity but it isn't sensitive enough. Additionally, you got RNA-Seq (long and short reads) and you could probably get a protein database from related species (e.g. the clade of your target species from OrthoDB). In my opinion, you could try using TSEBRA to combine your manually curated gene set with: a BRAKER1 run using the short reads, a BRAKER2 run using the protein database, and your long-read gene set from GeneMarkS-T. You could do this as described in the long-read protocol, with the only difference being that you add the manually curated gene set to the input of TSEBRA. However, TSEBRA might decide to exclude some of the transcripts from your manual gene set, which would not be desirable since they have very high specificity. For testing purposes, I implemented an option for TSEBRA such that TSEBRA can enforce all transcripts from an input gene. If you want to guarantee that all transcripts of your manual gene set are included in the TSEBRA result, I can give you access to a version of TSEBRA with this option.

nicmoya commented 2 years ago

@LarsGab yes, that's right! that's exactly the datasets we are working with. And yes, if possible, I would like access to the version of TSEBRA that can force a keep-all option for the manually curated gene set.

I would also like to test the current long-read branch while integrating my curated gene set. Although TSEBRA may remove some manual curations, it may point to some genes where our curated model is less faithful to the RNA-seq data. For this approach, would I simply run:

tsebra.py -g b1.augustus.hints.gtf,b2.augustus.hints.gtf,curated.genes.gtf -e b1.hintsfile.gff,b2.hintsfile.gff -l ong_read_protocol/gmst.global.gtf -c long_reads.cfg -o tsebra.gtf

Lastly, I recently tested the long_read protocol using our own data (excluding the manual curations). First I performed a default TSEBRA run to integrate BRAKER1 and BRAKER2 models. Then I performed a long-read protocol run to integrate BRAKER1, BRAKER2 and GeneMarkS-T models. The long-read protocol yielded a set of protein sequences with slightly lower BUSCO (nematoda_odb10) completeness (97.6%) relative to the default BRAKER1+2 run (99%). Although not a direct measure of sensitivity, in my experience reductions in BUSCO score are always paired with reductions in sensitivity. I could further investigate this result (I can run the same pipeline in C. elegans where I can use the reference annotation as a truth set), but it might just be that the long_read.cfg might be tailored to the organism you were working with when developing the long-read branch. Could you perhaps describe what informed your decisions on parameter changes when updating the config file for the long-read protocol?

LarsGab commented 2 years ago

Hi, I will upload a TSEBRA version with the keep-all option by the end of this week.

Your command line looks correct and it should work.

You might be correct that the configuration of the long-read version of TSEBRA isn't fitted for all species as the amount of long-read data available during development was very limited. If you want to adjust the configuration, I would suggest that you try different values for intron_support, e_1, e_4, e_5, e_6.
The _support values in the config file specify the minimum fraction that has to be supported by extrinsic evidence. If a transcript has lower evidence support in start/stop-codon and intron, it will be filtered out. For the current long read configuration, this means that all transcripts must have either all introns or their stop supported. I would suggest decreasing intronsupport if you want to change anything here. This can be especially helpful if you think that the sensitivity at the gene level is not high enough. The e parameter are thresholds that are used to allow some difference between the different scores of two transcripts at the same locus. In short, the thresholds correspond to scores as follows: e_1: relative fraction of supported introns, e_2: relative fraction of supported stop-codons, e_3 relative fraction of supported start-codons, e_4: absolute intron support, e_5: absolute stop-codon support, e_6 absolute start-codon support. If you want to go more in-depth, you can take a look at our paper. I would try to increase e_1, e_4, e_5, e_6, especially if you want to keep more alternative isoforms per gene.

nicmoya commented 2 years ago

Thank you Lars! For all the advise and for working on the new option for TSEBRA. We very much appreciate your feedback and responsiveness. I will play with these parameters when doing new test runs.

Best, Nic

LarsGab commented 2 years ago

I've added the --keep_gtf option to the 'main' branch (normal TSEBRA version without long-reads) and to the 'long_reads' branch. This option takes a list of GTF files as input in the same way as --gtf, with the difference that all transcripts of these gene sets will be included in the output. In your case, if you want to keep all transcripts from the manual gene set you can run: tsebra.py -g b1.augustus.hints.gtf,b2.augustus.hints.gtf -k curated.genes.gtf -e b1.hintsfile.gff,b2.hintsfile.gff -c default.cfg -o tsebra.gtf ... or with long reads on the 'long_reads' branch: tsebra.py -g b1.augustus.hints.gtf,b2.augustus.hints.gtf -k curated.genes.gtf -e b1.hintsfile.gff,b2.hintsfile.gff -l ong_read_protocol/gmst.global.gtf -c long_reads.cfg -o tsebra.gtf Best, Lars

nicmoya commented 2 years ago

Thank you Lars! I will start testing this new feature of TSEBRA soon. I'll let you know if I experience any issues.

Best, Nic

LarsGab commented 2 years ago

I close this issue since all questions have been answered or it has been inactive for a long time.

bijendrabio commented 1 year ago

I've added the --keep_gtf option to the 'main' branch (normal TSEBRA version without long-reads) and to the 'long_reads' branch. This option takes a list of GTF files as input in the same way as --gtf, with the difference that all transcripts of these gene sets will be included in the output. In your case, if you want to keep all transcripts from the manual gene set you can run: tsebra.py -g b1.augustus.hints.gtf,b2.augustus.hints.gtf -k curated.genes.gtf -e b1.hintsfile.gff,b2.hintsfile.gff -c default.cfg -o tsebra.gtf ... or with long reads on the 'long_reads' branch: tsebra.py -g b1.augustus.hints.gtf,b2.augustus.hints.gtf -k curated.genes.gtf -e b1.hintsfile.gff,b2.hintsfile.gff -l ong_read_protocol/gmst.global.gtf -c long_reads.cfg -o tsebra.gtf Best, Lars

@LarsGab does -k takes transdecoder predicted ORFs gtf?