gamcil / clinker

Gene cluster comparison figure generator
MIT License
507 stars 66 forks source link

Support for PROKKA GenBank files? #9

Closed marade closed 3 years ago

marade commented 3 years ago

Is there a way to make this work? Or do you plan to add support for these? Thanks!

gamcil commented 3 years ago

Sorry, I've never used PROKKA before. How are the files different? Might this be related to https://github.com/gamcil/clinker/issues/4?

marade commented 3 years ago

Quite possibly it is related to #4. PROKKA is probably the most widely used quick-annotation program right now, so I reckon many people will want this. It appears PROKKA uses BioPerl to generate the files, e.g.

https://metacpan.org/pod/Bio::DB::GenBank

It's really quite easy to run PROKKA and generate them yourself. For your convenience I've attached a GenBank file generated by PROKKA, which I ran on Pseudomonas Aeruginosa PAO1, though this is not ideal since it's only one contig and it's named '1'. PAO1.zip

Note as well the comments about the GenBank format on the PROKKA home page. Thanks much!

gamcil commented 3 years ago

Could you give some more info about the error you were running into? The file you uploaded seems to load in fine on my end

marade commented 3 years ago

The problem appeared to arise from the contig names generated by a SPades genome assembly and then annotated by PROKKA, where clinker would choke on the first (LOCUS) line of the GenBank file, e.g.

LOCUS NODE_1_length_395402_cov_27.667845395402 bp DNA linear

gamcil commented 3 years ago

Okay this is definitely BioPython's GenBank parser not being able to parse long locus names, as you said. Unfortunately, there doesn't seem to be a way to get around it since they explicitly count columns when parsing the LOCUS line (i.e. maximum 16 characters for that field unless stealing from the length field, discussed here: https://github.com/biopython/biopython/issues/747).

Unless I can get around to completely switching from the BioPython parser to something else, I don't think there's much I can do about this I'm afraid. In the meantime, could you try the --centre flag in PROKKA to rename your contigs to be NCBI compliant (as mentioned in the PROKKA readme), then run clinker again?

marade commented 3 years ago

I'll try this when I get a chance, though if #10 gets solved this will no longer matter to me, since I try to avoid GenBank format whenever possible.

marade commented 3 years ago

The good news is using the --compliant switch for PROKKA apparently allows the script to continue beyond where it would previously crash, but see #21 mentioned above.

gamcil commented 3 years ago

Will close this one too since the PROKKA flag works and GFF support has been added with v0.0.10.