bg7 / BG7

bacterial genome annotation system
bg7.ohnosequences.com
13 stars 7 forks source link

tool for formatting input genome fasta file #14

Closed marina-manrique closed 12 years ago

marina-manrique commented 12 years ago

@pablopareja It would be good to format the input genome FASTA file so the FASTA file to annotate has always the same header structure, @rtobes and I have decided this could be an appropriate header

CONTIG_ID|Former header

where CONTIG_ID is the ProjectName+6 chars number, for example ECO000001 ECO000002

Doing this way you could always get the contig ID splitting by '|' and getting the fist token.

Besides the formatted FASTA file it'd be good to have a tsv file with the CONTIG_ID and the corresponding former header

pablopareja commented 12 years ago

Ok, I'm on it ! Just one thing, would you mind if the contig ids had a syntax like: ECO1, ECO2, .... ECO1010.... ECOXXX instead of writing all those ugly zeros?

rtobes commented 12 years ago

With ugly zeros, it is the usual form (sorry)

pablopareja commented 12 years ago

Ok... :P

pablopareja commented 12 years ago

I just committed the changes for all this, you can have a look at FixFastaHeaders new program (it has its own jar file and it's been incorporated to BG7 jar file)

marina-manrique commented 12 years ago

How must be the part of the executions.xml file of this program?

pablopareja commented 12 years ago

You can check the parameters for this program in the wiki:

https://github.com/bg7/BG7/wiki/Fix-fasta-headers

pablopareja commented 12 years ago

I just implemented the corresponding quality control program for 'FixFastaHeaders'. You can find more information in the wiki. So I'm closing this issue now that all this has been implemented/solved ;)

marina-manrique commented 12 years ago

Manual Quality control done.

I've checked the following things (and everything was OK)