How to create subsets of GenBank flat files keeping features for input into clinker?

jelber2 commented 3 years ago

Hi,

This sounds super cool! Especially for the purposes of small-scale synteny analysis (I think). I really do not have much experience parsing GenBank flat files, and was curious if you might know or have any scripts that could potentially take an entire vertebrate genome's GenBank flat file (with annotations) and output say a 100,000 bases upstream and dowstream from a particular annotated gene [Genome Data Viewer from NCBI has the option, but I don't think this is possible programmatically access outside of the GUI]. I found some tools and could experiment with GenBank parsing with bioPython (such as https://gist.github.com/jrjhealey/06a7fbdfe495bc5a8824ed152dff7919/), but that particular one does not keep the annotation features (although perhaps I could modify it to do so).

Best, Jean Elbers

jelber2 commented 3 years ago

So the same author also wrote a similar script that I can modify more easily. https://gist.github.com/jrjhealey/2df3c65c7a70cbca4862e94620e4d7b2

gaworj commented 3 years ago

Hi @jelber2 , I have encountered similar problem with clinker input processing of gbk files. I have 6 prokka annotated bacterial genomes and want to extract some regions from them to visualize using clinker. Can you explain the method or share the script you finally used for parsing/slizing gbk files?

Bests, Jan

jelber2 commented 3 years ago

I had used https://gist.github.com/jrjhealey/2df3c65c7a70cbca4862e94620e4d7b2 to slice between two desired genes (note that for this to work, you need to have desired gene names and gene and CDS features in the GenBank flat file (gbk)). In the example below, the gbk files had gene and not locus_tag hence the use of perl to change locus_tag to gene.

wget https://gist.githubusercontent.com/jrjhealey/2df3c65c7a70cbca4862e94620e4d7b2/raw/c3c77e4d759d13bd1e5bfa08e11c0f125b76cf71/slice_genes.py

cp slice_genes.py slice_genes2.py 
perl -pi -e "s/locus_tag/gene/g" slice_genes2.py 
perl -pi -e "s/CDS/gene/g" slice_genes2.py 

# requires biopython installed
python2 slice_genes2.py -r gene1:gene2 -i in.gbk > sliced.gbk

If you don't have gene names, then perhaps you need to combine slice_gene.py and this script Genbank_slicer.py https://gist.github.com/jrjhealey/06a7fbdfe495bc5a8824ed152dff7919 to be able to also keep the features (which Genbank_slicer.py as far as I know does not do) but slice_gene.py does.

gamcil / clinker

How to create subsets of GenBank flat files keeping features for input into clinker? #8