Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
350 stars 79 forks source link

IsoSeq trascriptome data #722

Closed ardy20 closed 7 months ago

ardy20 commented 9 months ago

Hi Can we use the IsoSeq transcriptome data for the annotation? If yes, how to add the data into the script. Regards

KatharinaHoff commented 9 months ago

There are two options:

  1. https://github.com/Gaius-Augustus/BRAKER/blob/master/docs/long_reads/long_read_protocol.md This protocol is a bit older but it allows you to easily use both short and long reads. I recommend looking at it and understanding it even if you don't apply it, but choose to do the next option
  2. Run BRAKER with short reads, and run BRAKER with long reads, separately, merge the resulting gene sets with TSEBRA (see 1.)

In theory, it is rather simple to apply BRAKER3 to long reads in combination with an OrthoDB partition. In practice, we have neither cleanly evaluated this, nor implemented it.

I have a development docker container that currently allows you to input a bam file with splice aligned long reads - only long reads! - instead of a bam file with splice aligned short reads. (Do not input a fastq file, do not input SRA IDs, really only bam input.) It also needs the OrthoDB partition fasta file as input.

singularity build braker_lr.sif docker://katharinahoff/playground:devel

singularity exec braker_lr.sif braker.pl --genome=genome.fa --prot_seq=orthodb_partition.fa --bam=longreads.bam 

As I said: this is a development/playground, not a readily developed BRAKER version. It works, two people tested it, independently. What we can carefully say already is that you need a lot of very high quality PacBio isoseq reads to get good accuracy by running BRAKER this way. If you have low coverage libraries, or older data with a higher error rate, I advise against using the long read data in this way, at all.

[In addition run the standard BRAKER3 with short reads + OrthoDB paritition -> merge the two BRAKER gene sets with TSEBRA.]