gamcil / clinker

Gene cluster comparison figure generator
MIT License
507 stars 66 forks source link

Problem with -gf and -cm parameteres #106

Open mattbird567 opened 7 months ago

mattbird567 commented 7 months ago

Hello, I am attempting to colour and group sections of a large pESI plasmid. I have 7 different groups along the plasmid backbone and have grouped all of these genes within a CSv file attributing the genes to the correct groups as well as a colour CSV file with the appropriate colours that theses groupd should have. The problem i am facing is that whne i create the Clinker some groups are being coloured incorrectly whilst its also just not grouping certain genes into the groups i have defined. Is there anyway to force Clinker to group and colour according to my parameters? I have attached the files below in case you wanted to see what i have been trying along with the command i am running.

clinker rep_H211340782_plas_col.gbk rep_H204960951_plas_col.gbk rep_H153520460_plas_col.gbk rep_H213340902_plas_col.gbk -gf genes.csv -cm colours.csv -p -ufo

colours.csv genes.csv

I have also uploaded the GBK files (as TXT files as they aren't supported) rep_H153520460_plas_col.txt rep_H204960951_plas_col.txt rep_H211340782_plas_col.txt rep_H213340902_plas_col.txt

mhagar commented 6 months ago

I was trying to figure out how to use the -gf feature so I looked at your files as examples.

I don't know if this is helpful, but the first genbank file you provided (rep_H153520460_plas_col.txt) doesn't actually have any instances of NCLOOJ_* in it - is that perhaps why some genes don't get put into the groups you defined?

Edit: Ah nevermind I've figured out how this works and see why only one of your genbank files match the feature table :)) Disregard this

Amgomez96 commented 6 months ago

Hello, I had the same problem. I have 24 groups and when i run it without -gf it assigns different colors to the corresponding groups. When I run the code with -gf and the .csv, only 5 groups are colored and it is assigning those functions incorrectly to genes that do not have those functions.

code: clinker *gb -p -gf output4.csv

without -gf: Screen Shot 2023-12-19 at 13 13 44

with the csv Screen Shot 2023-12-19 at 13 14 48

csv file: output4.csv

Can someone help me understand why they are not assigning all the functions to their respective genes?

Thanks

gamcil commented 6 months ago

Unfortunately I think you are both hitting some weirdness with how the genes get grouped and I don't have a good resolution for it. Currently they are computed by finding disjoint sets of genes based on the gene-gene alignments - if there is an alignment between genes in two different defined function groups, they will be merged into a single group leading to the incorrect function/colour. You could try raising the identity/similarity cutoffs to avoid low quality alignments which lead to bad merges (e.g. the correct match may have 100% identity, but some gene from another group has a ~30% identity match). Alternatively, if you specify the function for EVERY gene you are looking at and use the --no_align flag (i.e. no alignments are performed), the colours should work but the gene-gene links will be lost.