gamcil / clinker

Gene cluster comparison figure generator
MIT License
507 stars 66 forks source link

Support for GFF+FAA files? #10

Closed marade closed 3 years ago

marade commented 3 years ago

Since GenBank format isn't particularly user-friendly, please consider adding support for alternate input using GFF + FAA files. Your work on this tool is much appreciated.

gamcil commented 3 years ago

Currently clinker uses BioPython for parsing files, which does not yet have the ability to parse GFF. Potentially in the future I'll swap over to the parsing library I wrote for cblaster which can handle either, but it would take a pretty big reworking so not planned at the moment.

marade commented 3 years ago

May I suggest the gffutils module for parsing GFF files? It's fairly straightforward and has worked great for me.

http://daler.github.io/gffutils/

It appears they intend to integrate this into BioPython anyway:

https://biopython.org/wiki/GFF_Parsing

gamcil commented 3 years ago

Oh cool, I'll look into it. Thanks!

gamcil commented 3 years ago

I've added an initial attempt at GFF3 parsing using gffutils in the gff3 branch if you want to try that out. Looks for GFF files (extensions .gtf, .gff, .gff3) as well as GenBank, and will look for a corresponding FASTA file of the same name (extensions .fa, .fsa, .fna, .fasta, .faa).

E: Note that, as GenBank files are treated, GFF files with multiple regions are treated as gene clusters with multiple loci and will be drawn on the same line in the visualisation

marade commented 3 years ago

Gosh that was fast. Can't wait to try it!

marade commented 3 years ago

It appears to have processed the GFF successfully at least:

[21:23:18] INFO - Generating results summary... [21:23:18] INFO - Writing alignments to output [21:23:18] INFO - Building clustermap.js visualisation [21:23:18] INFO - Writing to: plot /usr/local/lib/python3.6/dist-packages/clinker/align.py:356: RuntimeWarning: invalid value encountered in true_divide matrix /= matrix.max() Traceback (most recent call last): File "/usr/local/bin/clinker", line 8, in sys.exit(main()) File "/usr/local/lib/python3.6/dist-packages/clinker/main.py", line 153, in main hide_alignment_headers=args.hide_aln_headers, File "/usr/local/lib/python3.6/dist-packages/clinker/main.py", line 77, in clinker plot_clusters(globaligner, output=None if plot is True else plot) File "/usr/local/lib/python3.6/dist-packages/clinker/plot.py", line 114, in plot_clusters data = clusters.to_data() File "/usr/local/lib/python3.6/dist-packages/clinker/align.py", line 201, in to_data for i in self.order(i=i, method=method) File "/usr/local/lib/python3.6/dist-packages/clinker/align.py", line 371, in order linkage = hierarchy.linkage(squareform(matrix), method=method) File "/usr/local/lib/python3.6/dist-packages/scipy/spatial/distance.py", line 2184, in squareform is_valid_dm(X, throw=True, name='X') File "/usr/local/lib/python3.6/dist-packages/scipy/spatial/distance.py", line 2260, in is_valid_dm 'symmetric.') % name) ValueError: Distance matrix 'X' must be symmetric.

kforcone commented 3 years ago

I have this same issue as @marade "ValueError: Distance matrix 'X' must be symmetric.", I'm running Clinker on a server with .gbk files, but this error happens every time I run it. It could likely be the formatting of the .gbk files as in other peoples issues, but I haven't identified it yet.

gamcil commented 3 years ago

@kforcone: Could you open a new issue and upload the files causing you the error?

gamcil commented 3 years ago

@marade Sorry for taking so long on this, just today got around to reworking it. I was having issues with GFF/FASTA files of specific regions downloaded from NCBI with their graphic viewer, since the start/end of features in those GFF files are relative to the entire parent scaffold, not the specific extracted region, so now the parser accounts for that too.

Anyway, I've merged the GFF+FASTA parser into master now if you'd like to try it out and see if you have any issues.

The issue about the distance matrix should also have been fixed already by https://github.com/gamcil/clinker/pull/16/commits/4f4c53dfc34c49503f1263754c96ecfddf5104ee.

marade commented 3 years ago

Cool, I will try this out as soon as I can.

gamcil commented 3 years ago

This is now added in clinker v0.0.10 so I'll close the issue - if you run into any bugs feel free to reopen it.

marade commented 3 years ago

Getting back to this...It looks like it processes GFF gene+CDS features just fine, but it chokes for gene+tRNA, etc with an error like this:

ValueError: Found no CDS features in gnl|Prokka|blahpB1A1_76 [../trim-assemble2/blahpB1A1/prokka/blahpB1A1.gff]

So probably need some logic in there to deal with situations where you don't get gene+CDS, because there can be many of those.