Very few genes in braker.gff3, weird format

NatJWalker-Hale commented 1 year ago

Hi guys,

I am curious about the output from BRAKER3. I am running using the latest singularity container, with the following command (taken from the top of braker.log):

/opt/BRAKER/scripts/braker.pl --species=myspecies --gff3 --threads 32 --genome=path/to/myspecies/genome.fa.masked --prot_seq=/path/to/model/proteins.faa --bam=aligned_reads_Aligned.out.bam --workingdir=.

(RNA-seq reads were aligned with STAR).

The run appears to finish successfully. The dir Augustus contains a GTF and GFF3 with expected format (i.e., 3rd column contains gene, mRNA, start_codon, CDS, exon, etc.), containing ~38,000 genes.

However, the files braker.gtf, braker.gff3, etc., have a different format:

ptg000001l  AUGUSTUS    CDS 18104   19279   1   +   0   ID=g1.t1.CDS1;Parent=g1.t1;
ptg000001l  AUGUSTUS    stop_codon  19277   19279   .   +   0   ID=g1.t1.stop1;Parent=g1.t1;
ptg000001l  AUGUSTUS    stop_codon  53542   53544   .   -   0   ID=g9.t1.stop1;Parent=g9.t1;
ptg000001l  AUGUSTUS    CDS 53542   53766   1   -   0   ID=g9.t1.CDS1;Parent=g9.t1;
ptg000001l  AUGUSTUS    start_codon 53764   53766   .   -   0   ID=g9.t1.start1;Parent=g9.t1;
ptg000001l  AUGUSTUS    stop_codon  64896   64898   .   -   0   ID=g11.t1.stop1;Parent=g11.t1;
ptg000001l  AUGUSTUS    CDS 64896   66872   1   -   0   ID=g11.t1.CDS1;Parent=g11.t1;
ptg000001l  AUGUSTUS    start_codon 66870   66872   .   -   0   ID=g11.t1.start1;Parent=g11.t1;
ptg000001l  AUGUSTUS    start_codon 165896  165898  .   +   0   ID=g16.t1.start1;Parent=g16.t1;

3rd column contains only CDS, start_codon, stop_codon elements. Also, this file only lists ~14k genes.

Do you have any idea what might be happening here? There is not a huge amount of RNA-seq data, so I wonder if when filtering for supported predictions many get removed? But also the output gff3 seems totally malformed, so is there some error in writing the final output?

Please let me know if you have any leads. For now I will just use the predictions from Augustus as is.

Thanks,

Nat

NatJWalker-Hale commented 1 year ago

Oh - in support of the low support filtering idea (but also perhaps an output labeling error), braker.unsupported.gtf is ordinarily formatted:

ptg000001l  AUGUSTUS    gene    18104   19279   .   +   .   g1
ptg000001l  AUGUSTUS    transcript  18104   19279   1   +   .   g1.t1
ptg000001l  AUGUSTUS    start_codon 18104   18106   .   +   0   transcript_id "g1.t1"; gene_id "g1";
ptg000001l  AUGUSTUS    CDS 18104   19279   1   +   0   transcript_id "g1.t1"; gene_id "g1";
ptg000001l  AUGUSTUS    exon    18104   19279   .   +   .   transcript_id "g1.t1"; gene_id "g1";
ptg000001l  AUGUSTUS    stop_codon  19277   19279   .   +   0   transcript_id "g1.t1"; gene_id "g1";
ptg000001l  AUGUSTUS    gene    29522   29743   .   +   .   g2
ptg000001l  AUGUSTUS    transcript  29522   29743   1   +   .   g2.t1
ptg000001l  AUGUSTUS    start_codon 29522   29524   .   +   0   transcript_id "g2.t1"; gene_id "g2";
ptg000001l  AUGUSTUS    CDS 29522   29743   1   +   0   transcript_id "g2.t1"; gene_id "g2";

and contains

grep "gene   "  braker.unsupported.gtf | wc -l
38274

~38k genes. So either many many things are being filtered, or the TSEBRA merging of Augustus and GeneMark-ETP predictions is broken somehow.

Thanks again,

Nat

NatJWalker-Hale commented 1 year ago

Ah, seems like this is very likely an issue on my end - there are no prothint evidence files and errors for GeneMark-ETP indicate protein-based annotation failed. I will try to fix that and rerun.

JohnUrban commented 1 year ago

@NatJWalker-Hale I believe I am diagnosing the same problem as you right now. Are there any stdout/stderr/log files you found reporting errors? What files clued you in on the protein-based annotation fail?

NatJWalker-Hale commented 1 year ago

@JohnUrban et al. - in case it is informative,errors/GeneMark-ETP.stderr contains the following:

FASTA index file /home/tguser/NWH/Maustralis/HiFi/hifiasm/51mer/annotation/20230305_redo/braker3/GeneMark-ETP/data/genome.softmasked.fasta.fai created.
Use of uninitialized value $ph1 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph2 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in division (/) at /opt/ETP/bin/gmes/parse_set.pl line 208.
Illegal division by zero at /opt/ETP/bin/gmes/parse_set.pl line 208.
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
Illegal division by zero at /opt/ETP/bin/train_super.pl line 184.
06-Mar-23 06:41:49 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.2 with step 0.04
06-Mar-23 06:41:49 - INFO: Running prediction with masking penalty = 0
error: Program exited due to an error in command: /opt/ETP/bin/gmes/gmes_petap.pl --seq /home/tguser/NWH/Maustralis/HiFi/hifiasm/51mer/annotation/20230305_redo/braker3/GeneMark-ETP/proteins.fa/penalty/contigsatdnhyou.fasta --soft_mask 1000 --max_mask 40000  --predict_with /home/tguser/NWH/Maustralis/HiFi/hifiasm/51mer/annotation/20230305_redo/braker3/GeneMark-ETP/proteins.fa/model/output.mod --cores 32 --mask_penalty 0
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Traceback (most recent call last):
  File "/opt/ETP/bin/printRnaAlternatives.py", line 353, in <module>
    main()
  File "/opt/ETP/bin/printRnaAlternatives.py", line 289, in main
    candidates = loadIntrons(args.genemark)
  File "/opt/ETP/bin/printRnaAlternatives.py", line 193, in loadIntrons
    for row in csv.reader(open(inputFile), delimiter='\t'):
FileNotFoundError: [Errno 2] No such file or directory: 'pred_m/genemark.gtf'
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Died at /opt/ETP/bin/format_back.pl line 14.
Died at /opt/ETP/bin/format_back.pl line 14.
Use of uninitialized value $ph1 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph2 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in division (/) at /opt/ETP/bin/gmes/parse_set.pl line 208.
Illegal division by zero at /opt/ETP/bin/gmes/parse_set.pl line 208.
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
Illegal division by zero at /opt/ETP/bin/train_super.pl line 184.
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Died at /opt/ETP/bin/format_back.pl line 14.
Died at /opt/ETP/bin/format_back.pl line 14.

JohnUrban commented 1 year ago

What proteins are you using?

I saw similar error messages when I was using ODBv10 where many protein sequences ended with a "."
One solution was to remove the "." at the end of those sequences.
Another solution was to switch to ODBv11.

NatJWalker-Hale commented 1 year ago

Interesting - I have a combined DB of ODBv10 Embryophyta, Arabidopsis thaliana, and Beta vulgaris. I was receiving warnings for illegal characters so I stripped "*". I might try the switch to v11, thanks for the heads up.

JohnUrban commented 1 year ago

Pre-processed ODBv11 files - link courtesy of @tomasbruna from my debug thread: https://github.com/Gaius-Augustus/BRAKER/issues/577

https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/

JohnUrban commented 1 year ago

Having said all that, I am now looking at those log files, and have similar messages -- and I used ODBv11 for this run.

errors/GeneMark.stderr

FASTA index file /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/braker3/GeneMark-ETP/data/genome.softmasked.fasta.fai created.
Use of uninitialized value $ph1 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph2 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in division (/) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 208.
Illegal division by zero at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 208.
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
Illegal division by zero at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/train_super.pl line 184.
02-Mar-23 21:46:35 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.2 with step 0.04
02-Mar-23 21:46:35 - INFO: Running prediction with masking penalty = 0
error: Program exited due to an error in command: /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/gmes_petap.pl --seq /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/braker3/GeneMark-ETP/proteins.fa/penalty/contigsdstx1yvu.fasta --soft_mask 1000 --max_mask 40000  --predict_with /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/braker3/GeneMark-ETP/proteins.fa/model/output.mod --cores 16 --mask_penalty 0
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Traceback (most recent call last):
  File "/central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/printRnaAlternatives.py", line 353, in <module>
    main()
  File "/central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/printRnaAlternatives.py", line 289, in main
    candidates = loadIntrons(args.genemark)
  File "/central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/printRnaAlternatives.py", line 193, in loadIntrons
    for row in csv.reader(open(inputFile), delimiter='\t'):
FileNotFoundError: [Errno 2] No such file or directory: 'pred_m/genemark.gtf'
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.
Use of uninitialized value $ph1 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph2 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in division (/) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 208.
Illegal division by zero at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 208.
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
Illegal division by zero at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/train_super.pl line 184.
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.

Partial errors/GeneMark.stdout (edited to remove long lists of contig names):

4269    4269    0   100.00  complete.gtf
4269    4269    0   100.00  complete.gtf

# from file complete.id parsed IDs: 4270
# not found in input: 1
# done
primary_contig_1
primary_contig_2
.
.
.
primary_contig_178
primary_contig_182
# from file training.list parsed IDs: 4269
# not found in input: 0
# done
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found

65876   59008   6868    89.57   p_hints_nonhc.gtf
771884  59008   712876  7.64    r_hints_nonhc.gtf

# number of transcripts in file: 4269 /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/braker3/GeneMark-ETP/proteins.fa/genemark_supported.gtf
# number of genes in set: 4269
# removed partial: 0
# genes found for training: 4269
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found

Braker3 worked for me on the provided test3 and on a toy sample with ODBv10 that was fixed to remove the "." characters at the end of some protein sequences.

It looks like ODBv11 may have its own problems, but not "." at the end of protein sequences. I will try debugging Braker3 with ODBv11 on a small toy example. Meanwhile, I will see if Braker3 finishes on my real datasets/genome using the fixed ODBv10.

Best of luck. I will let you know if I figure anything out.

NatJWalker-Hale commented 1 year ago

Thanks! Looking forward to seeing if you discover something.

NatJWalker-Hale commented 1 year ago

@tomasbruna given the similar errors, could you elaborate on your comment https://github.com/Gaius-Augustus/BRAKER/issues/577#issuecomment-1452588922 about the dataset being too small. Is there any way around this?

To clarify, I've verified that test3 works when running through singularity, so it doesn't seem to be anything system-wide.

JohnUrban commented 1 year ago

Well, I can say that my toy sample (which is just a 1 Mb region of the genome) finished when using ODBv11... but there are issues.

It had those same errors in the GeneMark stdout and stderr files.
This time however the braker.gtf looked normal.

That made me wonder if the toy sample for the "fixed" ODB10 really finished without errors... and the answer is that it did have those same GeneMark errors. I just didn't realize it didn't work b/c the braker.gtf was normal looking.

I then looked more closely at my test3 results. Those looked fine. No GeneMark errors.

So... I guess I still have problems with my protein sequences.

LarsGab commented 1 year ago

Hi,

thanks, @JohnUrban for bringing the issue with the '.' at the end of protein sequences to our attention. We addressed it now in the latest update to our GitHub and Singularity container.

I also encountered the same issue with GeneMark-ETP if you run it with insufficient protein or RNA-Seq evidence, but I haven't analyzed this problem enough to determine when GeneMark-ETP has not enough evidence

We also fixed a small bug for the conversion from gtf to gff3, let me know if your problem with the format persists.

Best, Lars

JohnUrban commented 1 year ago

@NatJWalker-Hale What does this file look like for you: errors/get_etp_hints.stderr ?

For me it says:

Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.

NatJWalker-Hale commented 1 year ago

@JohnUrban exactly the same for me:

Died at /opt/ETP/bin/format_back.pl line 14.

@LarsGab thanks for your help. Would you like us to open a new issue for this coverage problem with GeneMark-ETP, since it seems to be the actual operative problem?

Incidentally, I currently have two annotations running, both BRAKER3, but one with significantly more RNA-seq coverage. We'll see if that helps to ameliorate the problem or not.

JohnUrban commented 1 year ago

@LarsGab I don't think it is a not-enough-data problem causing the GeneMark issue. For proteins, I have all metazoan proteins from ODBv10 (or v11) plus a bunch of other protein seqs from closely-related species. For RNA-seq, I have over 1.1 billion reads from a diversity of samples.

tomasbruna commented 1 year ago

Hi @JohnUrban and @NatJWalker-Hale,

sorry for the issues with /opt/ETP/bin/format_back.pl in GeneMark-ETP.

Would you mind uploading your input files somewhere so we can try to reproduce the issue? I'll also notify Alex Lomsadze, the maintainer of the ETP code.

Thanks, Tomas

JohnUrban commented 1 year ago

@tomasbruna I will get on sharing stuff tomorrow, but I may have figured it out by then.

@NatJWalker-Hale Are your BAMs produced by STAR?

Instead of providing my STAR-produced BAMs, I provided a raw fastq so Braker3/GeneMark can use HiSat2 with StringTie. It appears that this solved the problem -- at least the GeneMark stderr and stdout files did not have all those same errors, and it appears to have finished without error. The braker.gtf file is formatted fine as well. Command was:

braker.pl --genome=${ASM} --rnaseq_sets_ids=subsamp --rnaseq_sets_dirs=${RNA} --prot_seq=${PROTEINS} --workingdir=braker3 --threads=16

Contents of GeneMark stderr file:

==> braker3/errors/GeneMark-ETP.stderr <==
FASTA index file /central/groups/carnegie_poc/jurban/data/coral/scratch/toyanno/braker3/masterbranch/utroff/fastq/all-odb10/braker3/GeneMark-ETP/data/genome.softmasked.fasta.fai created.

Contents of GeneMark stdout file:

==> braker3/errors/GeneMark-ETP.stdout <==

27  27  0   100.00  complete.gtf
27  27  0   100.00  complete.gtf

# from file complete.id parsed IDs: 28
# not found in input: 1
# done
Sc0000000
# from file training.list parsed IDs: 25
# not found in input: 0
# done

285 236 49  82.81   p_hints_nonhc.gtf
317 236 81  74.45   r_hints_nonhc.gtf

304 237 67  77.96   prothint/prothint.gff
317 237 80  74.76   r_hints_nonhc.gtf

304 237 67  77.96   prothint/prothint.gff
317 237 80  74.76   r_hints_nonhc.gtf

# number of transcripts in file: 53 /central/groups/carnegie_poc/jurban/data/coral/scratch/toyanno/braker3/masterbranch/utroff/fastq/all-odb10/braker3/GeneMark-ETP/proteins.fa/genemark_supported.gtf
# number of genes in set: 51
# removed partial: 0
# genes found for training: 51

304 237 67  77.96   prothint/prothint.gff
317 237 80  74.76   r_hints_nonhc.gtf

# number of transcripts in file: 53 /central/groups/carnegie_poc/jurban/data/coral/scratch/toyanno/braker3/masterbranch/utroff/fastq/all-odb10/braker3/GeneMark-ETP/proteins.fa/genemark_supported.gtf
# number of genes in set: 51
# removed partial: 0
# genes found for training: 51

304 237 67  77.96   prothint/prothint.gff
317 237 80  74.76   r_hints_nonhc.gtf

# number of transcripts in file: 53 /central/groups/carnegie_poc/jurban/data/coral/scratch/toyanno/braker3/masterbranch/utroff/fastq/all-odb10/braker3/GeneMark-ETP/proteins.fa/genemark_supported.gtf
# number of genes in set: 51
# removed partial: 0
# genes found for training: 51

304 237 67  77.96   prothint/prothint.gff
317 237 80  74.76   r_hints_nonhc.gtf

JohnUrban commented 1 year ago

p.s. that was on my "toy example". I will now run it on the whole genome and entire dataset.

NatJWalker-Hale commented 1 year ago

@JohnUrban great minds! XD I've done the same experiment in parallel and achieved the same results. Seems to be running smoothly.

Suppose there must be something wrong with the STAR SAM formatting or etc.

JohnUrban commented 1 year ago

What I would like to figure out now is how to use strand-specific RNA-seq data -- or if it doesn't matter.

I can imagine 2 ways.

I don't know if this syntax will even work (but it would be cool): Instead of --stranded=+,- --bam=${FWD},${REV} Do: --stranded=+,- --rnaseq_sets_dirs=${FWD},${REV}
Run hisat2 to produce the BAMs.
```
hisat2 -x genome -U subsamp.fastq --dta -p 16 -S subsamp.sam
```
Then separate by strand as done previously and use the older syntax: --stranded=+,- --bam=${FWD},${REV}

alexlomsadze commented 1 year ago

Hi,

STAR aligner should be run with the option --outSAMstrandField intronMotif for BRAKER3 /GeneMark-ETP. See more details below.

From manuals: https://github.com/skovaka/stringtie2 http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf http://daehwankimlab.github.io/hisat2/manual/

StringTie manuals: “Any SAM spliced read alignment (a read alignment across at least one junction) needs to contain the XS tag to indicate the strand from which the RNA that produced this read originated.” “TopHat alignments already include this tag …” “HISAT2 should be run with the --dta option ...” “STAR aligner should be run with the option --outSAMstrandField intronMotif in order to generate this tag.”

STAR manuals: “For unstranded RNA-seq data, Cufflinks/Cuffdiff require spliced alignments with XS strand attribute, which STAR will generate with --outSAMstrandField intronMotif option. As required, the XS strand attribute will be generated for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical unannotated junctions) will be suppressed.”

About the stranded input:

Stranded data may improve quality of transcript assembly by StringTie in the case of genes overlapping on opposite strands. Stranded data is not supported by BRAKER3 right now.

Workaround is to align reads by HISAT2 and provide strand information for strand-specific RNA-Seq sets using command line option: --rna-strandness . And to use –dta option for unstranded RNA-Seq.

Then sorted BAM files from HISAT2 runs can be used as input to BRAKER3.

Alex

Alexandre Lomsadze, Senior Research Scientist; Joint Georgia Tech and Emory Wallace H Coulter Department of Biomedical Engineering at Georgia Tech; alexl@gatech.edu

JohnUrban commented 1 year ago

Many thanks @alexlomsadze -- thanks for the lead on --outSAMstrandField intronMotif.

I think the --dta option to hisat2 is meant to be used for both stranded and unstranded RNA-seq in prep for StringTie -- something to do with requiring longer anchor lengths around novel exon-exon junctions.

tberthelier commented 1 year ago

Hello,

I have the same type of problem. Indeed, my braker.gff and braker.gtf files are incomplete. I run BRAKER-v3.0.2 with RNA-seq data and a protein database made of a mix of OrthoDB v10 and "home-made" protein data. This database was working when I was using Braker2 for annotation. Only, this is the first time I use RNA data. My RNA data comes from the nfcore-rnases pipeline, which uses STAR. I checked and among the options used for STAR, the pipeline uses --outSAMstrandField intronMotif option.

Thanks in advance for your help and your work,

Sincerely,

Thomas.

LarsGab commented 1 year ago

@alexlomsadze Thanks for the information. I tested it for my data and I can confirm that this solved the problem in my case.

@tberthelier Did you get the same error message in your_braker_wdir/errors/GeneMark-ETP.stderr? Can you check if there are proteins with '.' at the end of their sequence in your_braker_wdir/proteins.fa?

tberthelier commented 1 year ago

Thank you for your answer and your reactivity, Concerning GeneMark-ETP.stderr, I don't get any errors. Here is the content of my file : FASTA index file /work/tberthelier/Species/Potnod/braker3/log/GeneMark-ETP/data/genome.softmasked.fasta.fai created. 06-Mar-23 17:53:27 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.2 with step 0.04 06-Mar-23 17:53:27 - INFO: Running prediction with masking penalty = 0 06-Mar-23 18:40:27 - INFO: Running prediction with masking penalty = 0.04 06-Mar-23 19:27:34 - INFO: Running prediction with masking penalty = 0.08 06-Mar-23 20:14:36 - INFO: Running prediction with masking penalty = 0.12 06-Mar-23 21:01:39 - INFO: Running prediction with masking penalty = 0.16 06-Mar-23 21:48:40 - INFO: Running prediction with masking penalty = 0.2 06-Mar-23 22:36:19 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.08 with step 0.02 06-Mar-23 22:36:19 - INFO: Running prediction with masking penalty = 0.02 06-Mar-23 23:23:24 - INFO: Running prediction with masking penalty = 0.06 07-Mar-23 00:10:42 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.04 with step 0.01 07-Mar-23 00:10:42 - INFO: Running prediction with masking penalty = 0.01 07-Mar-23 00:58:01 - INFO: Running prediction with masking penalty = 0.03 07-Mar-23 01:44:59 - INFO: Selected baseline penalty for the maximum # of correct reliable predictions: 0.02 07-Mar-23 01:44:59 - INFO: Running prediction with masking penalty = 0.05 07-Mar-23 02:32:05 - INFO: Masking penalty was set to 0.05

I also looked at my protein.fa file. I don't have any "." at the end of the sequence. However, I have "*"' at the end of my sequences added manually. I should add though that a colleague uses this same database, with Braker3 but without adding RNAseq data, and he gets normal files. Thank you again for your help, Thomas.

LarsGab commented 1 year ago

Ok, it looks like this is a different issue and that the GeneMark-ETP run has completed successfully. The '*' in protein.fa shouldn't be a problem.

What exactly is missing from your braker.gtf file? Does your Augustus/augustus.hints.gtf look normal?

tberthelier commented 1 year ago

Gene and mRNA are missing... Capture d’écran 2023-03-09 à 14 55 08

tberthelier commented 1 year ago

Capture d’écran 2023-03-09 à 14 56 48 For the gtf file.

LarsGab commented 1 year ago

Thanks for sharing the output. It looks like something went wrong during the combination of GeneMark-ETP and Augustus with TSEBRA. Which version of TSEBRA are you using or are you using our Singularity container? We made a small fix for combining the gene sets and converting to gff3 in commit 93b7a564d326bd9bbe3a31408d121efa37e775cd, I suspect that you ran BRAKER with an earlier version. If that is the case, you can fix it by running it again with the current version or doing the last two steps with your current data:

tsebra.py --gtf Augustus/augustus.hints.gtf,/GeneMark-ETP/genemark.gtf --keep_gtf /GeneMark-ETP/train
ing.gtf --hintfiles hintsfile.gff --cfg your_path_to_TSEBRA/config/braker3.cfg --out braker.gtf
gtf2gff.pl < braker.gtf --gff3 --out=braker.gff3

tberthelier commented 1 year ago

I use the TSEBRA-84270af/ version. Here is the braker.log of this run braker.log

tberthelier commented 1 year ago

Thank you for your help !

tberthelier commented 1 year ago

Hello, I tried to run the tsebra.py that you gave me. However, I don't get a gff3 like I used to get with Braker2. I was using the gff3 "augustus.hints.gff3" which suited me in every way for my further manipulations. Here is an example of what I get from tsebra.py.

I would like to have an organized gff3 with an ID=g1; etc, as I the 2nd screenshot shows. Is it possible to get this kind of results?

Concerning the version of tsebra, what is the current version to use ?

Thanks in advance for your answers and in general for your work!

Sincerely,

Thomas.

LarsGab commented 1 year ago

Hi,

you can use the rename_gtf.py to reformat the GTF file, e.g.:

rename_gtf.py --gtf tsebra_result.gtf --out tsebra_result_renamed.gtf

You can then convert it to gff3 and it should look similar to the second screenshot.

Please use the most current version of TSEBRA, which is currently v.1.1.0.

Best, Lars

tberthelier commented 1 year ago

Thank you for your answer. Concerning the version of TSEBRA, I use the old version. I will test with the new one as soon as possible and I will let you know.

Thanks to you,

Best wishes,

Thomas.

tberthelier commented 1 year ago

Hello, I ran BRAKER3 again with TSEBRA v.1.1.0. This time, I get a correctly formatted braker.gff3 file for the continuation of my analysis. However, I would like to have a precision. If I understood correctly, TSEBRA selects the transcripts predicted by Augustus and GeneMark-ETP. In the braker.gff3 obtained, am I not supposed to have other predictors than Augustus in the 2nd column of my braker.gff3 file ? In the braker.gft file, I do get gmst and GeneMark in this second column. Thanks in advance for your help,

Sincerely,

Thomas.

LarsGab commented 1 year ago

Hi,

predicitons from gmst and GeneMark.hmm3 come from the GeneMark-ETP output and the AUGUSTUSpredicitons from the Augustus gene set. Everything seems to be correct with your prediciton.

Best, Lars

tberthelier commented 1 year ago

Okay. Everything looks good to me. Thanks for all the help and explanations. I'm still available in case you need more information.

Have a nice day, Thomas.

tberthelier commented 1 year ago

Hello, Sorry to bother you again. Looking at the braker.gff3 I got during my last analysis, I notice that the information in the 3rd column (feature) is incomplete compared to the one in the augustus.hints.gff3 file. For example : braker.gff3

augustus.hints.gff3:

As you can see, there is no gene, mRNA type info in the braker.gff3 but only start CDS and intron... For the continuation of my analysis I really need gene and mRNA info. Do you have a solution ?

Thank you in advance for all the help you can give me,

Sincerely,

Thomas

tberthelier commented 1 year ago

Hello again I want to add to my last comment some clarifications. Counting the number of genes in braker.gff3, I find a total of 29739. However, in the braker.aa file, the total is 37513. So there is obviously an inconsistency (when I perform the same operations on the augustus files, I find the same number). I hope that these precisions can be useful to you. I am available if needed.

Thanks in advance for your help,

Sincerely,

Thomas.

LarsGab commented 1 year ago

Hi,

I'm sorry for the late response. We fixed the problem with converting gtf to gff3 format in the latest release (v3.0.3) of BRAKER. If you have downloaded BRAKER from GitHub, please make sure that you also update your TSEBRA version.

Best regards, Lars

Gaius-Augustus / BRAKER

Very few genes in braker.gff3, weird format #582