bg7 / BG7

bacterial genome annotation system
bg7.ohnosequences.com
13 stars 7 forks source link

LOCUS name not unique in gbk files #13

Open marina-manrique opened 12 years ago

marina-manrique commented 12 years ago

The locus name to create the gbk files is taken from the Genbank XML file info (from the tag 'locus_name').

When a genome has several contigs all of them have the same LOCUS name (the name taken from the XML file). The thing is this LOCUS name should be unique for each of the contigs, it could be based on the contig ID (unique for each of the contigs). According to the GenBank release note (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) the locus name 'is always sixteen characters or less, begins in position 13'.

So, for each contig, I suggest taking the whole contig ID if it's 16 chars or less, and if it's larger than 16 chars I suggest taking the last 16 chars of the contig ID, @rtobes what do you think?

rtobes commented 12 years ago

The locus name would be the contig ID.

We need a little program to add to the headers of the multifasta contig sequences a systematic ID composed by the project prefix and the number of the contig with six characters (SAL000001, SAL000002,....) See the corresponding issue "tool for formatting input genome fasta file"