gamcil / clinker

Gene cluster comparison figure generator
MIT License
507 stars 66 forks source link

How to create subsets of GenBank flat files keeping features for input into clinker? #8

Closed jelber2 closed 3 years ago

jelber2 commented 3 years ago


This sounds super cool! Especially for the purposes of small-scale synteny analysis (I think). I really do not have much experience parsing GenBank flat files, and was curious if you might know or have any scripts that could potentially take an entire vertebrate genome's GenBank flat file (with annotations) and output say a 100,000 bases upstream and dowstream from a particular annotated gene [Genome Data Viewer from NCBI has the option, but I don't think this is possible programmatically access outside of the GUI]. I found some tools and could experiment with GenBank parsing with bioPython (such as, but that particular one does not keep the annotation features (although perhaps I could modify it to do so).

Best, Jean Elbers

jelber2 commented 3 years ago

So the same author also wrote a similar script that I can modify more easily.

gaworj commented 3 years ago

Hi @jelber2 , I have encountered similar problem with clinker input processing of gbk files. I have 6 prokka annotated bacterial genomes and want to extract some regions from them to visualize using clinker. Can you explain the method or share the script you finally used for parsing/slizing gbk files?

Bests, Jan

jelber2 commented 3 years ago

I had used to slice between two desired genes (note that for this to work, you need to have desired gene names and gene and CDS features in the GenBank flat file (gbk)). In the example below, the gbk files had gene and not locus_tag hence the use of perl to change locus_tag to gene.


perl -pi -e "s/locus_tag/gene/g" 
perl -pi -e "s/CDS/gene/g" 

# requires biopython installed
python2 -r gene1:gene2 -i in.gbk > sliced.gbk

If you don't have gene names, then perhaps you need to combine and this script to be able to also keep the features (which as far as I know does not do) but does.