gatech-genemark / GeneMark-ETP

GeneMark-ETP: gene finding in eukaryotic genomes supported by transcriptome sequencing and protein homology
18 stars 3 forks source link

GeneMark-ETP for fungi #10

Open pstrope opened 1 year ago

pstrope commented 1 year ago

Hi,

I am testing GeneMark-ETP on fungal genomes. Haven't been successful. Does this work on fungi yet? The error I get is

### create RNA-Seq hints ... done

### predict genes gmst
### predict genes gmst ... done

### generate ProtHint predictions with GeneMarkS-T seeds
### generate ProtHint predictions with GeneMarkS-T seeds ... done

### filter gmst predictions
error, file not found: option --f1 complete.gtf
error on open file complete.id: No such file or directory
mv: cannot stat 'complete_uniq.gtf': No such file or directory
### filter gmst predictions ... done

### prepare genome sequence for training
error on open file /xxx/GeneMark-ETP/scer/out4/rnaseq/hints/Fungi.fa/complete.gtf: No such file or directory
error on create_regions.pl at ../bin/gmetp.pl line 2162.
alexlomsadze commented 10 months ago

Hello,

I will start with a comment for users interested in fungal protein-coding gene prediction.

The GeneMark-ETP algorithm, as it was published in 2023 in bioRxiv preprint, is designed to find genes in eukaryotic species. Fungi are eukaryotes, and GeneMark-ETP can be used for fungi gene prediction as-is.

In 2008, we published a fungi-specific gene-finding algorithm, GeneMark-ES-fungi https://pubmed.ncbi.nlm.nih.gov/18757608/, which demonstrated better accuracy than general eukaryotic gene finders on fungal species. An increase in accuracy was reached by improved modeling of intron branching point in fungal species.

The fungi branch point model was recently incorporated into GeneMark-ETP. By providing the optional command line parameter "--fungus" to GeneMark-ETP, users are switching the ETP algorithm to a novel and unpublished development, GeneMark-ETP-fungi.

Both GeneMark-ETP and GeneMark-ETP-fungi can be used for gene prediction in fungi. The fungal version of ETP is expected to be more accurate.

If there is some issue or error with the novel GeneMark-ETP-fungi algorithm, please report it to us. While waiting for our response, you may use the general GeneMark-ETP on fungal species.

Now, let's return to the reported issue: GeneMark-ETP failed to run on fungal species.

The reported failure happened before the fungi-specific block in GeneMark-ETP-fungi. This error is more general, and the GeneMark-ETP without the "--fungus" option should fail in the same location.

Similar ETP failure was tracked down to the issue with input protein file formatting: https://github.com/Gaius-Augustus/BRAKER/issues/577#issuecomment-1452511938

See comment by "JohnUrban commented on Mar 2". "And looking more into that, for some reason many of the OrthoDB protein sequences end with a period... e.g.:"

We accounted for possible errors in user input files and improved the stability of the ETP code: https://github.com/gatech-genemark/GeneMark-ETP/issues/6

Please check if there is a protein file formatting issue on your side, too. If it is, fix the protein file, update the ETP to the latest version, and try ETP again.

Thank you for the info, Alex