gamcil / clinker

Gene cluster comparison figure generator
MIT License
507 stars 66 forks source link

Clinker usage #84

Closed EmaBoni closed 2 years ago

EmaBoni commented 2 years ago

Sorry for the dumb question, I am trying to use clinker but I cannot analyze files within a folder.

I have cloned clinker from git and installed it via pip, as indicated in the readme, so now I have a clinker folder that includes the examples. I have created a dedicated environment and I am working in Jupyter notebook. This is the code that I am using:

import sys sys.path.append('C:\\Users\Emanuele Boni\clinker') import clinker import os proj_dir = 'C:\\Users\Emanuele Boni\clinker\examples' os.listdir(proj_dir)

This returns the content of the folder as ['A. alliaceus CBS 536.65.gbk', 'A. burnettii MST-FP2249.gbk', 'A. mulundensis DSM 5745.gbk', 'A. versicolor CBS 583.65.gbk', 'note.md', 'P. vexata CBS 129021.gbk']

However, if I try to run clinker proj_dir/* -p I get the error message SyntaxError: invalid syntax pointing at proj_dir

I have tried several things: creating a subfolder in the folder where I am running the notebook, writing the folder name as string and as variable (with and without quotes), running the lines of code directly from the command line instead of inside the notebook. None of these worked. I think I am not considering something very trivial, but I cannot figure out what it is.

Thank you for your help and for developing this tool! Emanuele

gamcil commented 2 years ago

Hi @EmaBoni, I haven't tried using clinker inside a Jupyter notebook before so I'm not exactly sure about that - from what you've listed there it looks like the command should be correct. Would it be possible to post the full error log from when you try to run the program inside the notebook or in the command line?

EmaBoni commented 2 years ago

Hello @gamcil , thank you for your answer! Here is the full attempt from the command line (NB: it is windows command line, not linux, might this be an issue?). (Clinker) is my python environment created with Anaconda Navigator

`(Clinker) C:\Users\Emanuele Boni\clinker>python

Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 05:59:00) [MSC v.1929 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

import clinker clinker

<module 'clinker' from 'C:\Users\Emanuele Boni\clinker\clinker\init.py'>

import os os.listdir('C:\Users\Emanuele Boni\clinker\examples')

['A. alliaceus CBS 536.65.gbk', 'A. burnettii MST-FP2249.gbk', 'A. mulundensis DSM 5745.gbk', 'A. versicolor CBS 583.65.gbk', 'note.md', 'P. vexata CBS 129021.gbk']

clinker 'C:\Users\Emanuele Boni\clinker\examples\*' -p

File "", line 1 clinker 'C:\Users\Emanuele Boni\clinker\examples\*' -p ^ SyntaxError: invalid syntax

clinker 'C:\Users\Emanuele Boni\clinker\examples'+'/' -p File "", line 1 clinker 'C:\Users\Emanuele Boni\clinker\examples'+'/' -p ^ SyntaxError: invalid syntax

clinker examples/ -p File "", line 1 clinker examples/ -p ^ SyntaxError: invalid syntax`

NB: the '^' arrow points at the first character after clinker

gamcil commented 2 years ago

Ah you are trying to run clinker from within the Python interactive shell, which is then recognising it as invalid syntax (since it isn't Python code). clinker should be run just from the command line itself - try exiting the Python shell and running the exact same command, e.g. clinker 'C:\Users\Emanuele Boni\clinker\examples\*' -p and it should work.

EmaBoni commented 2 years ago

This clarifies a lot, thank you! I managed to run the pipeline on the examples (resulting image is as expected) and on my files. This is what I get:

image

I am a bit uncertain about the result because I expected the sequences to have much higher identity (more groups matching, higher identity percentage for the group that is correctly recognized). The alignment is done on the protein sequences, is that correct? I will double check the gene sequences to make sure there are no errors there. Any idea of other things that I am not considering when aligning these two files?

gamcil commented 2 years ago

Yes the alignments are done on the protein sequences - however if they are missing, clinker will try to translate the regions corresponding to gene/CDS coordinates in the input file. Not sure what is causing the issue in your case, would you be able to upload your files?

EmaBoni commented 2 years ago

Ok! Yes, protein sequences are annotate in my files. Here are the two files that I am using: EBoni.zip

EmaBoni commented 2 years ago

Thanks a lot for your time and for your help!

gamcil commented 2 years ago

Just had some time to have a look at this. It seems the files are read in correctly (it is picking up the AA translations just fine), but the alignments are falling below the default identity threshold (30%) and so are getting filtered out. You can lower this threshold using the -i/--identity argument, e.g. clinker EBoni/*.gb -i 0.2 -p. That command gives me this: Screen Shot 2022-06-28 at 11 29 55 AM

EmaBoni commented 2 years ago

Thanks a lot! We found 18-19% identity threshold was ideal for us. You have been extremely helpful, thanks again for your time and for this very useful tool! Kind regards, Emanuele