Closed marade closed 3 years ago
Currently clinker uses BioPython for parsing files, which does not yet have the ability to parse GFF. Potentially in the future I'll swap over to the parsing library I wrote for cblaster which can handle either, but it would take a pretty big reworking so not planned at the moment.
May I suggest the gffutils module for parsing GFF files? It's fairly straightforward and has worked great for me.
http://daler.github.io/gffutils/
It appears they intend to integrate this into BioPython anyway:
Oh cool, I'll look into it. Thanks!
I've added an initial attempt at GFF3 parsing using gffutils in the gff3
branch if you want to try that out. Looks for GFF files (extensions .gtf, .gff, .gff3) as well as GenBank, and will look for a corresponding FASTA file of the same name (extensions .fa, .fsa, .fna, .fasta, .faa).
E: Note that, as GenBank files are treated, GFF files with multiple regions are treated as gene clusters with multiple loci and will be drawn on the same line in the visualisation
Gosh that was fast. Can't wait to try it!
It appears to have processed the GFF successfully at least:
[21:23:18] INFO - Generating results summary...
[21:23:18] INFO - Writing alignments to output
[21:23:18] INFO - Building clustermap.js visualisation
[21:23:18] INFO - Writing to: plot
/usr/local/lib/python3.6/dist-packages/clinker/align.py:356: RuntimeWarning: invalid value encountered in true_divide
matrix /= matrix.max()
Traceback (most recent call last):
File "/usr/local/bin/clinker", line 8, in
I have this same issue as @marade "ValueError: Distance matrix 'X' must be symmetric.", I'm running Clinker on a server with .gbk files, but this error happens every time I run it. It could likely be the formatting of the .gbk files as in other peoples issues, but I haven't identified it yet.
@kforcone: Could you open a new issue and upload the files causing you the error?
@marade Sorry for taking so long on this, just today got around to reworking it. I was having issues with GFF/FASTA files of specific regions downloaded from NCBI with their graphic viewer, since the start/end of features in those GFF files are relative to the entire parent scaffold, not the specific extracted region, so now the parser accounts for that too.
Anyway, I've merged the GFF+FASTA parser into master now if you'd like to try it out and see if you have any issues.
The issue about the distance matrix should also have been fixed already by https://github.com/gamcil/clinker/pull/16/commits/4f4c53dfc34c49503f1263754c96ecfddf5104ee.
Cool, I will try this out as soon as I can.
This is now added in clinker v0.0.10 so I'll close the issue - if you run into any bugs feel free to reopen it.
Getting back to this...It looks like it processes GFF gene+CDS features just fine, but it chokes for gene+tRNA, etc with an error like this:
ValueError: Found no CDS features in gnl|Prokka|blahpB1A1_76 [../trim-assemble2/blahpB1A1/prokka/blahpB1A1.gff]
So probably need some logic in there to deal with situations where you don't get gene+CDS, because there can be many of those.
Since GenBank format isn't particularly user-friendly, please consider adding support for alternate input using GFF + FAA files. Your work on this tool is much appreciated.