fwhelan / coinfinder

A tool for the identification of coincident (associating and dissociating) genes in pangenomes.
GNU General Public License v3.0
95 stars 10 forks source link

Duplicate row names error in "Read in gene_pa file.." step #51

Closed JoaodcPires closed 1 year ago

JoaodcPires commented 3 years ago

Hi Fiona,

I have managed to successfully run coinfinder using a gene_presence_absence.csv obtained using Roary on prokka annotated genomes. During this process, we observed that some cliques of interest in the network were surrounded by genes without annotation (e.g, only group_1234, group_3456, etc). To get around this, we decided to use bakta (https://github.com/oschwengers/bakta) for genome annotation as it should better annotate bacterial genome. Which would also likely result in a better annotated coincidence network.

Unfortunately, we ran into an error at the very end of coinfinder during R:

[1] "Read in gene_pa file.." Error in.rowNamesDF<-(x, value = value) : duplicate 'row.names' are not allowed Calls: rownames<- ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<- In addition: Warning message: non-unique values when setting 'row.names': 'DeoR.family.transcriptional.regulator', 'DNA.invertase', 'DNA.packaging.protein', 'FlhB.HrpN.YscU.SpaS.family.protein', 'Glutaredoxin.1', 'Serine.threonine.protein.phosphatase' Execution halted

I went back to the Roary gene_presence_absence.csv obtained from the bakta annotations and I identified the entries where this occurs. Some examples:

Screenshot 2021-11-02 at 17 36 20 Screenshot 2021-11-02 at 17 34 40

As one of the first steps of coinfinder is to format the gene_presence_absence.csv - Formating Roary output for input into coinfinder... - , I was wondering which columns are affected by this (Gene, Non.unique.Gene.name, Annotation) and whether coinfinder could potentially identify these duplicate names and deal with them internally?

Otherwise, would have any suggestions on how to modify the gene_presence_absence file directly?

Thanks in advance!

fwhelan commented 2 years ago

Hi Jpdcp, Hmm, very interesting. The step that's erroring in the R code is when the Gene column is set as the rowname in a dataframe in R. That suggests to me that the Gene column may contain more than one row with the value "DeoR.family.transcriptional.regulator" ( for e.g. ). However, it looks from your output that there is only one. It does seem a little suspect that the Non.unique.Gene.name seems to be empty/null in at least one row of each of these examples- I wonder if that could have something to do with it.

The format roary step takes the gene_p_a.csv input file and re-formats it as a gene\tgenome\n output file called "coincident-input-edges.csv". This is done line by line, so any duplicate gene names in the input won't be flagged in this step.

In terms of testing to see if the gene names are indeed unique, you could try something like length(unique(genepa$Gene)) vs. nrow(genepa$Gene) to see if they match in your code above, maybe.

fwhelan commented 1 year ago

Closing due to inactivity.