Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
352 stars 79 forks source link

error, file not found: option --f1 prothint/prothint.gff #643

Open fuesseler opened 1 year ago

fuesseler commented 1 year ago

Hello, I am annotating some vertebrate genomes using BRAKER3. I got quite good results on the first genome, but now with a related species' genome I am running into the troubles below. I am using protein (odb10 vertebrate and sauropsida) and Uniprot for all vertebrates, and RNAseq data aligned with HISAT2. I am running braker.pl version 3.0.3 via docker image.

BRAKER finishes, giving only the (probably relevant warning):

WARNING: Number of reliable training genes is low (543). Recommended are at least 600 genes

And I get very low BUSCO (17% Complete ) and only ~12k genes.

I looked at the error messages, and found that GeneMark-ETP likely failed:

get_etp_hints.stderr

Died at /opt/ETP/bin/format_back.pl line 14.

GeneMark-ETP.stderr

FASTA index file /hidden/GeneMark-ETP/data/genome.softmasked.fasta.fai created. 20-Jun-23 08:56:14 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.2 with step 0.04 20-Jun-23 08:56:14 - INFO: Running prediction with masking penalty = 0 error: Program exited due to an error in command: /opt/ETP/bin/gmes/gmes_petap.pl --seq /hidden/GeneMark-ETP/proteins.fa/penalty/contigswp0xjo_x.fasta --soft_mask 1000 --max_mask 40000 --predict_with /hidden/GeneMark-ETP/proteins.fa/model/output.mod --cores 40 --mask_penalty 0 error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Traceback (most recent call last): File "/opt/ETP/bin/printRnaAlternatives.py", line 353, in main() File "/opt/ETP/bin/printRnaAlternatives.py", line 289, in main candidates = loadIntrons(args.genemark) File "/opt/ETP/bin/printRnaAlternatives.py", line 193, in loadIntrons for row in csv.reader(open(inputFile), delimiter='\t'): FileNotFoundError: [Errno 2] No such file or directory: 'pred_m/genemark.gtf' error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Died at /opt/ETP/bin/format_back.pl line 14. Died at /opt/ETP/bin/format_back.pl line 14. error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Died at /opt/ETP/bin/format_back.pl line 14. Died at /opt/ETP/bin/format_back.pl line 14.

I suspect that the problem might be not enough / not diverse enough RNAseq evidence, as the protein input is the same as for the first genome, where everything worked without problem. Would you agree, or is there anything else I should try troubleshooting? The errors seem to point to ProtHint Output not being found though, if I interpret it correctly?

Would you recommend adding the RNAseq from the related genus (more diverse tissues) as evidence, even though there is quite some evolutionary distance (which might mean, worse alignment rate with Hisat2)?

KatharinaHoff commented 1 year ago

This issue is most likely caused by a lack of RNA-Seq evidence.

We currently can't predict how much RNA-Seq data will be enough for a successful run, an open problem that we are trying to solve. But it will take a couple of months, probably.

I recommend in this case to perform separate runs of BRAKER1 and BRAKER2, then test whether merging with TSEBRA is beneficial (it might not be with default settings, you might have to enforce the better gene set).

On Wed, Jun 21, 2023 at 2:20 PM fuesseler @.***> wrote:

Hello, I am annotating some vertebrate genomes using BRAKER3. I got quite good results on the first genome, but now with a related species' genome I am running into the troubles below. I am using protein (odb10 vertebrate and sauropsida) and Uniprot for all vertebrates, and RNAseq data aligned with HISAT2. I am running braker.pl version 3.0.3 via docker image.

BRAKER finishes, giving only the (probably relevant warning):

WARNING: Number of reliable training genes is low (543). Recommended are at least 600 genes

And I get very low BUSCO (17% Complete ) and only ~12k genes.

I looked at the error messages, and found that GeneMark-ETP likely failed:

get_etp_hints.stderr

Died at /opt/ETP/bin/format_back.pl line 14.

GeneMark-ETP.stderr

FASTA index file /hidden/GeneMark-ETP/data/genome.softmasked.fasta.fai created. 20-Jun-23 08:56:14 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.2 with step 0.04 20-Jun-23 08:56:14 - INFO: Running prediction with masking penalty = 0 error: Program exited due to an error in command: /opt/ETP/bin/gmes/ gmes_petap.pl --seq /hidden/GeneMark-ETP/proteins.fa/penalty/contigswp0xjo_x.fasta --soft_mask 1000 --max_mask 40000 --predict_with /hidden/GeneMark-ETP/proteins.fa/model/output.mod --cores 40 --mask_penalty 0 error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Traceback (most recent call last): File "/opt/ETP/bin/printRnaAlternatives.py", line 353, in main() File "/opt/ETP/bin/printRnaAlternatives.py", line 289, in main candidates = loadIntrons(args.genemark) File "/opt/ETP/bin/printRnaAlternatives.py", line 193, in loadIntrons for row in csv.reader(open(inputFile), delimiter='\t'): FileNotFoundError: [Errno 2] No such file or directory: 'pred_m/genemark.gtf' error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Died at /opt/ETP/bin/format_back.pl line 14. Died at /opt/ETP/bin/format_back.pl line 14. error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Died at /opt/ETP/bin/format_back.pl line 14. Died at /opt/ETP/bin/format_back.pl line 14.

I suspect that the problem might be not enough / not diverse enough RNAseq evidence, as the protein input is the same as for the first genome, where everything worked without problem. Would you agree, or is there anything else I should try troubleshooting? The errors seem to point to ProtHint Output not being found though, if I interpret it correctly?

Would you recommend adding the RNAseq from the related genus (more diverse tissues) as evidence, even though there is quite some evolutionary distance (which might mean, worse alignment rate with Hisat2)?

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/643, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JFZ6TM6TPHQOQPMUCDXMLRH7ANCNFSM6AAAAAAZOVGOIY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

fuesseler commented 1 year ago

Thanks for the quick reply! Before BRAKER3 was released I was doing the BRAKER1+2 + TSEBRA with prefbraker1.cfg. Here I had the problem (with all my genomes), that the BUSCO was quite low at ~80% but about double than expected genes (36 thousand). Which is why I was quite happy when with BRAKER3 came out, as for the first genome it gives better BUSCOs and gene numbers.

Actually, as for the same species that I encountered this GeneMark problem with, I have a different (less contiguous) draft genome from another individual - let's call it 10XDraft for simplicity. So, I tried running BRAKER3 with the same evidence (protein + HISAT2 alignments same reads just aligned to the other genome). Hisat2 alignment success was similar. Interestingly, I did not encounter the same errors as shown above with this run on the 10XDraft, despite pretty much the same amount of evidence.

I tried to track where the amount of evidence starts to diminish, and I think it is at the beginning of GMST filtering and classification step, so I appended the filter_gmst.log files of both BRAKER3 runs (same evidence, just different draft genome for the same species). In the first diamond database that gets constructed, both runs have the same number of sequences (6221485). then something must be different in the consecutive steps(I think, gms2hints.pl, proteins_from_gtf.pl and diamond blastp) because in the 2nd diamond database that gets constructed, the 10XDraft retains 89990 sequences, while the first, "problematic" run only retains 1180 sequences.

I do have more RNAseq (but from blood of different individuals, which was a tissue already included), so I will try including that, maybe its not as redundant as expected. Maybe that helps.

10XDraft_filter_gmst.log Problematic_filter_gmst.log

KatharinaHoff commented 1 year ago

Please contact Alexandre Lomsadze and Mark Borovsky at Georgia Tech. They might not be looking at the Github Issues page. This seems to be a GeneMark-ETP problem.

fuesseler commented 1 year ago

Supplying more RNA evidence "fixed" the problem, in the sense that GeneMark-ETP does not fail anymore and I get better end results. Thanks for your help !