gatech-genemark / ProtHint

Protein hint generation pipeline for gene finding in eukaryotic genomes
Other
55 stars 13 forks source link

The hints file(s) for GeneMark-EX contain less than 150 introns with multiplicity >= 4! #60

Open yzliu01 opened 3 months ago

yzliu01 commented 3 months ago

Hi @tomasbruna,

I ran Braker to predict gene structure and got the problem in the step running Genmark-EX as below. I used the reference genome and customized amino acid sequence database with the following command, which worked well with all species in the same genus except for the reference genome of one species. I am NOT use RNA-Seq data. braker.pl --genome="$genome" --prot_seq="$Apodiea_gene_AA"

braker.log


#**********************************************************************************
#                              RUNNING GENEMARK-EX                                 
#**********************************************************************************
# Sat Jun 22 22:59:29 2024: Preparing genemark_evidence file hints from manual hints...
# Sat Jun 22 22:59:29 2024: Checking whether file /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/genemark_hintsfile.gff contains enough hints and sufficient multiplicity information...
#*********
# WARNING:
# The hints file(s) for GeneMark-EX contain less than 150 introns with multiplicity >= 4! (In total, 2658 unique introns are contained. 16 have a multiplicity >= 4.)
# Possibly, you are trying to run braker.pl on data that does not provide sufficient multiplicity information. This will e.g. happen if you try to use introns generated from assembled RNA-Seq transcripts; or if you try to run braker.pl in epmode with mappings from proteins without sufficient hits per locus. Or if you use the example data set.
# A low number of intron hints with sufficient multiplicity may result in a crash of GeneMark-EX (it should not crash with the example data set).
#*********
# Sat Jun 22 22:59:29 2024: Running GeneMark-EP
# Sat Jun 22 22:59:29 2024: changing into GeneMark-EP directory /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/GeneMark-EP
cd /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/GeneMark-EP
# Sat Jun 22 22:59:29 2024: Running gmes_petap.pl
/home/user/miniforge3/envs/braker3/bin/perl /home/user/proj/sofwtare/gmetp_linux_64/bin/gmes/gmes_petap.pl --verbose 
--seq /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/genome.fa 
--EP /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/genemark_hintsfile.gff 
--cores=8  --gc_donor 0.001 --evidence /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/genemark_evidence.gff  
--soft_mask auto 1>/output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/GeneMark-EP.stdout 
2>/output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/errors/GeneMark-EP.stderr

output error file tail -30 gene_annotation_AndBic.40649391.e The number of mairs aligned (8804/8804 (100%) pairs aligned) is much smaller than other reference genomes (448747/448747 (100%) pairs aligned). It seems that this reference genome is very distant from the protein database. Can you give any hints to address this issue?


[Sat Jun 22 22:58:48 2024] Enqueueing pair 8796/8804 (99.9%). Est. time left: 00:00:01 (hh:mm:ss)
[Sat Jun 22 22:59:27 2024] 8804/8804 (100%) pairs aligned
[Sat Jun 22 22:59:27 2024] Alignment of pairs finished
[Sat Jun 22 22:59:27 2024] Translating coordinates from local pair level to contig level
[Sat Jun 22 22:59:27 2024] Finished spliced alignment
[Sat Jun 22 22:59:27 2024] Flagging top chains
[Sat Jun 22 22:59:28 2024] Processing the output
[Sat Jun 22 22:59:29 2024] Output processed
[Sat Jun 22 22:59:29 2024] ProtHint finished.
ERROR in file /home/user/miniforge3/envs/braker3/bin/braker.pl at line 5414
Failed to execute: /home/user/miniforge3/envs/braker3/bin/perl /home/user/proj/sofwtare/gmetp_linux_64/bin/gmes/gmes_petap.pl --verbos ...
tomasbruna commented 3 months ago

It's unlikely that the reference proteins would be close enough for some members of the genus but not for others.

This looks like some technical issue, possibly with the assembly of that one genome. You can send me the assembly of one of the genomes where the algorithm works well, the assembly of the problematic one, and the protein database. I'll take a look (please share by email, bruna.tomas@gmail.com, if you don't want your data to appear here).

yzliu01 commented 3 months ago

OK, it is a bit too big and I just sent the data to you by email. Please check it. Appreciated!