Closed NatJWalker-Hale closed 11 months ago
Oh - in support of the low support filtering idea (but also perhaps an output labeling error), braker.unsupported.gtf is ordinarily formatted:
ptg000001l AUGUSTUS gene 18104 19279 . + . g1
ptg000001l AUGUSTUS transcript 18104 19279 1 + . g1.t1
ptg000001l AUGUSTUS start_codon 18104 18106 . + 0 transcript_id "g1.t1"; gene_id "g1";
ptg000001l AUGUSTUS CDS 18104 19279 1 + 0 transcript_id "g1.t1"; gene_id "g1";
ptg000001l AUGUSTUS exon 18104 19279 . + . transcript_id "g1.t1"; gene_id "g1";
ptg000001l AUGUSTUS stop_codon 19277 19279 . + 0 transcript_id "g1.t1"; gene_id "g1";
ptg000001l AUGUSTUS gene 29522 29743 . + . g2
ptg000001l AUGUSTUS transcript 29522 29743 1 + . g2.t1
ptg000001l AUGUSTUS start_codon 29522 29524 . + 0 transcript_id "g2.t1"; gene_id "g2";
ptg000001l AUGUSTUS CDS 29522 29743 1 + 0 transcript_id "g2.t1"; gene_id "g2";
and contains
grep "gene " braker.unsupported.gtf | wc -l
38274
~38k genes. So either many many things are being filtered, or the TSEBRA merging of Augustus and GeneMark-ETP predictions is broken somehow.
Thanks again,
Nat
Ah, seems like this is very likely an issue on my end - there are no prothint evidence files and errors for GeneMark-ETP indicate protein-based annotation failed. I will try to fix that and rerun.
@NatJWalker-Hale I believe I am diagnosing the same problem as you right now. Are there any stdout/stderr/log files you found reporting errors? What files clued you in on the protein-based annotation fail?
@JohnUrban et al. - in case it is informative,errors/GeneMark-ETP.stderr
contains the following:
FASTA index file /home/tguser/NWH/Maustralis/HiFi/hifiasm/51mer/annotation/20230305_redo/braker3/GeneMark-ETP/data/genome.softmasked.fasta.fai created.
Use of uninitialized value $ph1 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph2 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in division (/) at /opt/ETP/bin/gmes/parse_set.pl line 208.
Illegal division by zero at /opt/ETP/bin/gmes/parse_set.pl line 208.
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
Illegal division by zero at /opt/ETP/bin/train_super.pl line 184.
06-Mar-23 06:41:49 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.2 with step 0.04
06-Mar-23 06:41:49 - INFO: Running prediction with masking penalty = 0
error: Program exited due to an error in command: /opt/ETP/bin/gmes/gmes_petap.pl --seq /home/tguser/NWH/Maustralis/HiFi/hifiasm/51mer/annotation/20230305_redo/braker3/GeneMark-ETP/proteins.fa/penalty/contigsatdnhyou.fasta --soft_mask 1000 --max_mask 40000 --predict_with /home/tguser/NWH/Maustralis/HiFi/hifiasm/51mer/annotation/20230305_redo/braker3/GeneMark-ETP/proteins.fa/model/output.mod --cores 32 --mask_penalty 0
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Traceback (most recent call last):
File "/opt/ETP/bin/printRnaAlternatives.py", line 353, in <module>
main()
File "/opt/ETP/bin/printRnaAlternatives.py", line 289, in main
candidates = loadIntrons(args.genemark)
File "/opt/ETP/bin/printRnaAlternatives.py", line 193, in loadIntrons
for row in csv.reader(open(inputFile), delimiter='\t'):
FileNotFoundError: [Errno 2] No such file or directory: 'pred_m/genemark.gtf'
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Died at /opt/ETP/bin/format_back.pl line 14.
Died at /opt/ETP/bin/format_back.pl line 14.
Use of uninitialized value $ph1 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph2 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in division (/) at /opt/ETP/bin/gmes/parse_set.pl line 208.
Illegal division by zero at /opt/ETP/bin/gmes/parse_set.pl line 208.
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
Illegal division by zero at /opt/ETP/bin/train_super.pl line 184.
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Died at /opt/ETP/bin/format_back.pl line 14.
Died at /opt/ETP/bin/format_back.pl line 14.
What proteins are you using?
Interesting - I have a combined DB of ODBv10 Embryophyta, Arabidopsis thaliana, and Beta vulgaris. I was receiving warnings for illegal characters so I stripped "*". I might try the switch to v11, thanks for the heads up.
Pre-processed ODBv11 files - link courtesy of @tomasbruna from my debug thread: https://github.com/Gaius-Augustus/BRAKER/issues/577
Having said all that, I am now looking at those log files, and have similar messages -- and I used ODBv11 for this run.
errors/GeneMark.stderr
FASTA index file /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/braker3/GeneMark-ETP/data/genome.softmasked.fasta.fai created.
Use of uninitialized value $ph1 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph2 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in division (/) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 208.
Illegal division by zero at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 208.
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
Illegal division by zero at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/train_super.pl line 184.
02-Mar-23 21:46:35 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.2 with step 0.04
02-Mar-23 21:46:35 - INFO: Running prediction with masking penalty = 0
error: Program exited due to an error in command: /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/gmes_petap.pl --seq /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/braker3/GeneMark-ETP/proteins.fa/penalty/contigsdstx1yvu.fasta --soft_mask 1000 --max_mask 40000 --predict_with /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/braker3/GeneMark-ETP/proteins.fa/model/output.mod --cores 16 --mask_penalty 0
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Traceback (most recent call last):
File "/central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/printRnaAlternatives.py", line 353, in <module>
main()
File "/central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/printRnaAlternatives.py", line 289, in main
candidates = loadIntrons(args.genemark)
File "/central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/printRnaAlternatives.py", line 193, in loadIntrons
for row in csv.reader(open(inputFile), delimiter='\t'):
FileNotFoundError: [Errno 2] No such file or directory: 'pred_m/genemark.gtf'
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.
Use of uninitialized value $ph1 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph2 in addition (+) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 205.
Use of uninitialized value $ph0 in division (/) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 208.
Illegal division by zero at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 208.
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GT.mat: No such file or directory
cat: AG.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
cat: GC.mat: No such file or directory
Illegal division by zero at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/train_super.pl line 184.
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.
Partial errors/GeneMark.stdout
(edited to remove long lists of contig names):
4269 4269 0 100.00 complete.gtf
4269 4269 0 100.00 complete.gtf
# from file complete.id parsed IDs: 4270
# not found in input: 1
# done
primary_contig_1
primary_contig_2
.
.
.
primary_contig_178
primary_contig_182
# from file training.list parsed IDs: 4269
# not found in input: 0
# done
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
65876 59008 6868 89.57 p_hints_nonhc.gtf
771884 59008 712876 7.64 r_hints_nonhc.gtf
# number of transcripts in file: 4269 /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/braker3/GeneMark-ETP/proteins.fa/genemark_supported.gtf
# number of genes in set: 4269
# removed partial: 0
# genes found for training: 4269
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
error, no valid sequences were found
Braker3 worked for me on the provided test3
and on a toy sample with ODBv10 that was fixed to remove the "." characters at the end of some protein sequences.
It looks like ODBv11 may have its own problems, but not "." at the end of protein sequences. I will try debugging Braker3 with ODBv11 on a small toy example. Meanwhile, I will see if Braker3 finishes on my real datasets/genome using the fixed ODBv10.
Best of luck. I will let you know if I figure anything out.
Thanks! Looking forward to seeing if you discover something.
@tomasbruna given the similar errors, could you elaborate on your comment https://github.com/Gaius-Augustus/BRAKER/issues/577#issuecomment-1452588922 about the dataset being too small. Is there any way around this?
To clarify, I've verified that test3 works when running through singularity, so it doesn't seem to be anything system-wide.
Well, I can say that my toy sample (which is just a 1 Mb region of the genome) finished when using ODBv11... but there are issues.
That made me wonder if the toy sample for the "fixed" ODB10 really finished without errors... and the answer is that it did have those same GeneMark errors. I just didn't realize it didn't work b/c the braker.gtf was normal looking.
I then looked more closely at my test3 results. Those looked fine. No GeneMark errors.
So... I guess I still have problems with my protein sequences.
Hi,
thanks, @JohnUrban for bringing the issue with the '.' at the end of protein sequences to our attention. We addressed it now in the latest update to our GitHub and Singularity container.
I also encountered the same issue with GeneMark-ETP if you run it with insufficient protein or RNA-Seq evidence, but I haven't analyzed this problem enough to determine when GeneMark-ETP has not enough evidence
We also fixed a small bug for the conversion from gtf to gff3, let me know if your problem with the format persists.
Best, Lars
@NatJWalker-Hale What does this file look like for you: errors/get_etp_hints.stderr
?
For me it says:
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.
@JohnUrban exactly the same for me:
Died at /opt/ETP/bin/format_back.pl line 14.
@LarsGab thanks for your help. Would you like us to open a new issue for this coverage problem with GeneMark-ETP, since it seems to be the actual operative problem?
Incidentally, I currently have two annotations running, both BRAKER3, but one with significantly more RNA-seq coverage. We'll see if that helps to ameliorate the problem or not.
@LarsGab I don't think it is a not-enough-data problem causing the GeneMark issue. For proteins, I have all metazoan proteins from ODBv10 (or v11) plus a bunch of other protein seqs from closely-related species. For RNA-seq, I have over 1.1 billion reads from a diversity of samples.
Hi @JohnUrban and @NatJWalker-Hale,
sorry for the issues with /opt/ETP/bin/format_back.pl
in GeneMark-ETP.
Would you mind uploading your input files somewhere so we can try to reproduce the issue? I'll also notify Alex Lomsadze, the maintainer of the ETP code.
Thanks, Tomas
@tomasbruna I will get on sharing stuff tomorrow, but I may have figured it out by then.
@NatJWalker-Hale Are your BAMs produced by STAR?
Instead of providing my STAR-produced BAMs, I provided a raw fastq so Braker3/GeneMark can use HiSat2 with StringTie. It appears that this solved the problem -- at least the GeneMark stderr and stdout files did not have all those same errors, and it appears to have finished without error. The braker.gtf file is formatted fine as well. Command was:
braker.pl --genome=${ASM} --rnaseq_sets_ids=subsamp --rnaseq_sets_dirs=${RNA} --prot_seq=${PROTEINS} --workingdir=braker3 --threads=16
Contents of GeneMark stderr file:
==> braker3/errors/GeneMark-ETP.stderr <==
FASTA index file /central/groups/carnegie_poc/jurban/data/coral/scratch/toyanno/braker3/masterbranch/utroff/fastq/all-odb10/braker3/GeneMark-ETP/data/genome.softmasked.fasta.fai created.
Contents of GeneMark stdout file:
==> braker3/errors/GeneMark-ETP.stdout <==
27 27 0 100.00 complete.gtf
27 27 0 100.00 complete.gtf
# from file complete.id parsed IDs: 28
# not found in input: 1
# done
Sc0000000
# from file training.list parsed IDs: 25
# not found in input: 0
# done
285 236 49 82.81 p_hints_nonhc.gtf
317 236 81 74.45 r_hints_nonhc.gtf
304 237 67 77.96 prothint/prothint.gff
317 237 80 74.76 r_hints_nonhc.gtf
304 237 67 77.96 prothint/prothint.gff
317 237 80 74.76 r_hints_nonhc.gtf
# number of transcripts in file: 53 /central/groups/carnegie_poc/jurban/data/coral/scratch/toyanno/braker3/masterbranch/utroff/fastq/all-odb10/braker3/GeneMark-ETP/proteins.fa/genemark_supported.gtf
# number of genes in set: 51
# removed partial: 0
# genes found for training: 51
304 237 67 77.96 prothint/prothint.gff
317 237 80 74.76 r_hints_nonhc.gtf
# number of transcripts in file: 53 /central/groups/carnegie_poc/jurban/data/coral/scratch/toyanno/braker3/masterbranch/utroff/fastq/all-odb10/braker3/GeneMark-ETP/proteins.fa/genemark_supported.gtf
# number of genes in set: 51
# removed partial: 0
# genes found for training: 51
304 237 67 77.96 prothint/prothint.gff
317 237 80 74.76 r_hints_nonhc.gtf
# number of transcripts in file: 53 /central/groups/carnegie_poc/jurban/data/coral/scratch/toyanno/braker3/masterbranch/utroff/fastq/all-odb10/braker3/GeneMark-ETP/proteins.fa/genemark_supported.gtf
# number of genes in set: 51
# removed partial: 0
# genes found for training: 51
304 237 67 77.96 prothint/prothint.gff
317 237 80 74.76 r_hints_nonhc.gtf
p.s. that was on my "toy example". I will now run it on the whole genome and entire dataset.
@JohnUrban great minds! XD I've done the same experiment in parallel and achieved the same results. Seems to be running smoothly.
Suppose there must be something wrong with the STAR SAM formatting or etc.
What I would like to figure out now is how to use strand-specific RNA-seq data -- or if it doesn't matter.
I can imagine 2 ways.
I don't know if this syntax will even work (but it would be cool):
Instead of
--stranded=+,- --bam=${FWD},${REV}
Do:
--stranded=+,- --rnaseq_sets_dirs=${FWD},${REV}
Run hisat2 to produce the BAMs.
hisat2 -x genome -U subsamp.fastq --dta -p 16 -S subsamp.sam
Then separate by strand as done previously and use the older syntax:
--stranded=+,- --bam=${FWD},${REV}
Hi,
STAR aligner should be run with the option --outSAMstrandField intronMotif for BRAKER3 /GeneMark-ETP. See more details below.
From manuals: https://github.com/skovaka/stringtie2 http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf http://daehwankimlab.github.io/hisat2/manual/
StringTie manuals: “Any SAM spliced read alignment (a read alignment across at least one junction) needs to contain the XS tag to indicate the strand from which the RNA that produced this read originated.” “TopHat alignments already include this tag …” “HISAT2 should be run with the --dta option ...” “STAR aligner should be run with the option --outSAMstrandField intronMotif in order to generate this tag.”
STAR manuals: “For unstranded RNA-seq data, Cufflinks/Cuffdiff require spliced alignments with XS strand attribute, which STAR will generate with --outSAMstrandField intronMotif option. As required, the XS strand attribute will be generated for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical unannotated junctions) will be suppressed.”
About the stranded input:
Stranded data may improve quality of transcript assembly by StringTie in the case of genes overlapping on opposite strands. Stranded data is not supported by BRAKER3 right now.
Workaround is to align reads by HISAT2 and provide strand information for strand-specific RNA-Seq sets using command line option: --rna-strandness
Then sorted BAM files from HISAT2 runs can be used as input to BRAKER3.
Alex
Alexandre Lomsadze, Senior Research Scientist; Joint Georgia Tech and Emory Wallace H Coulter Department of Biomedical Engineering at Georgia Tech; alexl@gatech.edu
Many thanks @alexlomsadze -- thanks for the lead on --outSAMstrandField intronMotif
.
I think the --dta
option to hisat2 is meant to be used for both stranded and unstranded RNA-seq in prep for StringTie -- something to do with requiring longer anchor lengths around novel exon-exon junctions.
Hello,
I have the same type of problem. Indeed, my braker.gff and braker.gtf files are incomplete. I run BRAKER-v3.0.2 with RNA-seq data and a protein database made of a mix of OrthoDB v10 and "home-made" protein data. This database was working when I was using Braker2 for annotation. Only, this is the first time I use RNA data. My RNA data comes from the nfcore-rnases pipeline, which uses STAR. I checked and among the options used for STAR, the pipeline uses --outSAMstrandField intronMotif option.
Thanks in advance for your help and your work,
Sincerely,
Thomas.
@alexlomsadze Thanks for the information. I tested it for my data and I can confirm that this solved the problem in my case.
@tberthelier Did you get the same error message in your_braker_wdir/errors/GeneMark-ETP.stderr
? Can you check if there are proteins with '.' at the end of their sequence in your_braker_wdir/proteins.fa
?
Thank you for your answer and your reactivity,
Concerning GeneMark-ETP.stderr, I don't get any errors. Here is the content of my file :
FASTA index file /work/tberthelier/Species/Potnod/braker3/log/GeneMark-ETP/data/genome.softmasked.fasta.fai created. 06-Mar-23 17:53:27 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.2 with step 0.04 06-Mar-23 17:53:27 - INFO: Running prediction with masking penalty = 0 06-Mar-23 18:40:27 - INFO: Running prediction with masking penalty = 0.04 06-Mar-23 19:27:34 - INFO: Running prediction with masking penalty = 0.08 06-Mar-23 20:14:36 - INFO: Running prediction with masking penalty = 0.12 06-Mar-23 21:01:39 - INFO: Running prediction with masking penalty = 0.16 06-Mar-23 21:48:40 - INFO: Running prediction with masking penalty = 0.2 06-Mar-23 22:36:19 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.08 with step 0.02 06-Mar-23 22:36:19 - INFO: Running prediction with masking penalty = 0.02 06-Mar-23 23:23:24 - INFO: Running prediction with masking penalty = 0.06 07-Mar-23 00:10:42 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.04 with step 0.01 07-Mar-23 00:10:42 - INFO: Running prediction with masking penalty = 0.01 07-Mar-23 00:58:01 - INFO: Running prediction with masking penalty = 0.03 07-Mar-23 01:44:59 - INFO: Selected baseline penalty for the maximum # of correct reliable predictions: 0.02 07-Mar-23 01:44:59 - INFO: Running prediction with masking penalty = 0.05 07-Mar-23 02:32:05 - INFO: Masking penalty was set to 0.05
I also looked at my protein.fa file. I don't have any "." at the end of the sequence. However, I have "*"' at the end of my sequences added manually. I should add though that a colleague uses this same database, with Braker3 but without adding RNAseq data, and he gets normal files. Thank you again for your help, Thomas.
Ok, it looks like this is a different issue and that the GeneMark-ETP run has completed successfully. The '*' in protein.fa
shouldn't be a problem.
What exactly is missing from your braker.gtf
file? Does your Augustus/augustus.hints.gtf
look normal?
Gene and mRNA are missing...
For the gtf file.
Thanks for sharing the output. It looks like something went wrong during the combination of GeneMark-ETP and Augustus with TSEBRA. Which version of TSEBRA are you using or are you using our Singularity container? We made a small fix for combining the gene sets and converting to gff3 in commit 93b7a564d326bd9bbe3a31408d121efa37e775cd, I suspect that you ran BRAKER with an earlier version. If that is the case, you can fix it by running it again with the current version or doing the last two steps with your current data:
tsebra.py --gtf Augustus/augustus.hints.gtf,/GeneMark-ETP/genemark.gtf --keep_gtf /GeneMark-ETP/train
ing.gtf --hintfiles hintsfile.gff --cfg your_path_to_TSEBRA/config/braker3.cfg --out braker.gtf
gtf2gff.pl < braker.gtf --gff3 --out=braker.gff3
I use the TSEBRA-84270af/ version. Here is the braker.log of this run braker.log
Thank you for your help !
Hello, I tried to run the tsebra.py that you gave me. However, I don't get a gff3 like I used to get with Braker2. I was using the gff3 "augustus.hints.gff3" which suited me in every way for my further manipulations. Here is an example of what I get from tsebra.py.
I would like to have an organized gff3 with an ID=g1; etc, as I the 2nd screenshot shows. Is it possible to get this kind of results?
Concerning the version of tsebra, what is the current version to use ?
Thanks in advance for your answers and in general for your work!
Sincerely,
Thomas.
Hi,
you can use the rename_gtf.py to reformat the GTF file, e.g.:
rename_gtf.py --gtf tsebra_result.gtf --out tsebra_result_renamed.gtf
You can then convert it to gff3 and it should look similar to the second screenshot.
Please use the most current version of TSEBRA, which is currently v.1.1.0.
Best, Lars
Thank you for your answer. Concerning the version of TSEBRA, I use the old version. I will test with the new one as soon as possible and I will let you know.
Thanks to you,
Best wishes,
Thomas.
Hello, I ran BRAKER3 again with TSEBRA v.1.1.0. This time, I get a correctly formatted braker.gff3 file for the continuation of my analysis. However, I would like to have a precision. If I understood correctly, TSEBRA selects the transcripts predicted by Augustus and GeneMark-ETP. In the braker.gff3 obtained, am I not supposed to have other predictors than Augustus in the 2nd column of my braker.gff3 file ? In the braker.gft file, I do get gmst and GeneMark in this second column. Thanks in advance for your help,
Sincerely,
Thomas.
Hi,
predicitons from gmst
and GeneMark.hmm3
come from the GeneMark-ETP output and the AUGUSTUS
predicitons from the Augustus gene set.
Everything seems to be correct with your prediciton.
Best, Lars
Okay. Everything looks good to me. Thanks for all the help and explanations. I'm still available in case you need more information.
Have a nice day, Thomas.
Hello, Sorry to bother you again. Looking at the braker.gff3 I got during my last analysis, I notice that the information in the 3rd column (feature) is incomplete compared to the one in the augustus.hints.gff3 file. For example : braker.gff3
augustus.hints.gff3:
As you can see, there is no gene, mRNA type info in the braker.gff3 but only start CDS and intron... For the continuation of my analysis I really need gene and mRNA info. Do you have a solution ?
Thank you in advance for all the help you can give me,
Sincerely,
Thomas
Hello again I want to add to my last comment some clarifications. Counting the number of genes in braker.gff3, I find a total of 29739. However, in the braker.aa file, the total is 37513. So there is obviously an inconsistency (when I perform the same operations on the augustus files, I find the same number). I hope that these precisions can be useful to you. I am available if needed.
Thanks in advance for your help,
Sincerely,
Thomas.
Hi,
I'm sorry for the late response. We fixed the problem with converting gtf to gff3 format in the latest release (v3.0.3) of BRAKER. If you have downloaded BRAKER from GitHub, please make sure that you also update your TSEBRA version.
Best regards, Lars
Hi guys,
I am curious about the output from BRAKER3. I am running using the latest singularity container, with the following command (taken from the top of braker.log):
/opt/BRAKER/scripts/braker.pl --species=myspecies --gff3 --threads 32 --genome=path/to/myspecies/genome.fa.masked --prot_seq=/path/to/model/proteins.faa --bam=aligned_reads_Aligned.out.bam --workingdir=.
(RNA-seq reads were aligned with STAR).
The run appears to finish successfully. The dir
Augustus
contains a GTF and GFF3 with expected format (i.e., 3rd column contains gene, mRNA, start_codon, CDS, exon, etc.), containing ~38,000 genes.However, the files braker.gtf, braker.gff3, etc., have a different format:
3rd column contains only CDS, start_codon, stop_codon elements. Also, this file only lists ~14k genes.
Do you have any idea what might be happening here? There is not a huge amount of RNA-seq data, so I wonder if when filtering for supported predictions many get removed? But also the output gff3 seems totally malformed, so is there some error in writing the final output?
Please let me know if you have any leads. For now I will just use the predictions from Augustus as is.
Thanks,
Nat