Format .gb for training

kimnegrette3 commented 4 years ago

Hi! I want to train augustus using a .gbff file I downloaded from ncbi https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/611/645/GCF_000611645.1_mono_v1/GCF_000611645.1_mono_v1_genomic.gbff.gz But the command: randomSplit.pl Monoraphidium_neglectum_genomic.gbff 100 fails with "size 100 is greater than the number of genes in file Monoraphidium_neglectum_genomic.gbff. Aborting." The file of course has more than 100 genes, but it seems that the format is not quite well. What should I exactly change in the file? I would really appreciate any help. Thanks!

Kimberly.

KatharinaHoff commented 4 years ago

You can download the same annotation in gff3 format, as well as the genome sequence. (Possibly you need to simplify sequence names in both files.) Use this to generate the GenBank file for training AUGUSTUS. Please do not use all genes. 2000 - 10000 genes are sufficient.

Our tools don’t work on NCBIs gbgff format.

Katharina

On Tue 19. May 2020 at 21:52, kimnegrette3 notifications@github.com wrote:

Hi! I want to train augustus using a .gbff file I downloaded from ncbi https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/611/645/GCF_000611645.1_mono_v1/GCF_000611645.1_mono_v1_genomic.gbff.gz But the command: randomSplit.pl Monoraphidium_neglectum_genomic.gbff 100 fails with "size 100 is greater than the number of genes in file Monoraphidium_neglectum_genomic.gbff. Aborting." The file of course has more than 100 genes, but it seems that the format is not quite well. What should I exactly change in the file? I would really appreciate any help. Thanks!

Kimberly.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/Augustus/issues/150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JB7DEI4OFOSAVYBO3TRSLPQVANCNFSM4NFJTNAQ .

lalalagartija commented 7 months ago

I had the same issue with the gb format. It comes from tha fact that randomSplits.pl searches for "LOCUS" as gene tag while in the genebank format the tag is given by "gene". I solved it by downloading the gff and genome fasta then gff2gbSmallDNA.pl genome.gff3 genome.fasta 100 genome_augustus.gb (use more than 100 for eukaryotes !) and then random splits worked

Gaius-Augustus / Augustus

Format .gb for training #150