AntonioDeFalco / SCEVAN

R package that automatically classifies the cells in the scRNA data by segregating non-malignant cells of tumor microenviroment from the malignant cells. It also infers the copy number profile of malignant cells, identifies subclonal structures and analyses the specific and shared alterations of each subpopulation.
https://www.nature.com/articles/s41467-023-36790-9
GNU General Public License v3.0
90 stars 25 forks source link

Remove filtering of duplicate gene symbols when rownames of counts matrix are Ensembl IDs #111

Closed allyhawkins closed 3 months ago

allyhawkins commented 4 months ago

I was coming across an error when trying to run this using an object that had Ensembl IDs as the row names rather than gene symbols. During the annotateGenes() function, I was getting the following error:

Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 7104, 7107

I narrowed this down to this line, which removes any genes that have a duplicated gene symbol in the reference edb matrix. However, you don't do the same thing to the mtx variable. https://github.com/AntonioDeFalco/SCEVAN/blob/228beead83187b74779f5d6bafae5ee143981bac/R/preProcessing.R#L32

This is probably necessary with gene symbols since the dimensions between edb and mtx may not match if duplicated values are present in edb. However, if using Ensembl IDs there are no duplicated IDs, so this step isn't necessary. Also you should only remove duplicated for IDs for the column indicated with use_geneID, although I think if it's gene_id, then I would skip this step all together.

AntonioDeFalco commented 3 months ago

Thanks @allyhawkins, I fixed it in the last commit f1394b3.

Regards