JCVenterInstitute / NSForest

A machine learning method for the discovery of the minimum marker gene combinations for cell type identification from single-cell RNA sequencing
MIT License
53 stars 20 forks source link

Several issues when using scanpy object converted from seurat #7

Closed ChristinaSteyn closed 1 month ago

ChristinaSteyn commented 2 years ago

Hi there,

Thank you so much for creating such an awesome tool! I am quite new to coding especially in python and encountered several errors when running the NSForest function. I am not sure if these problems were specific to my object but I thought I would post them here in case someone else has the same issues. I have tried to fix some of the errors and have gotten the function to finish but I'm not 100% sure if the output is correct.

---------------------------------------------------------------------------------------------------------------------------------------

The first error was in line 167 of the source code: "AttributeError: Can only use .cat accessor with a 'category' dtype" I changed this:

medianValues = pd.DataFrame(columns=adata.var_names, index=adata.obs[clusterLabelcolumnHeader].cat.categories)

to this:

medianValues = pd.DataFrame(columns=adata.var_names, index=adata.obs[clusterLabelcolumnHeader].unique())

which seemed to work

----------------------------------------------------------------------------------------------------------------------------------------------

The second error was in line 172: "ValueError: Shape of passed values is (49211, 1), indices imply (49211, 31002)" which I think is because the input to create the pandas data frame was in the wrong format.

When I changed this:

Subset_dataframe = pd.DataFrame(data = subset_adata.X, index = subset_adata.obs, columns = subset_adata.var_names)

to this:

Subset_dataframe = pd.DataFrame(data = subset_adata.X.toarray(), index = list(subset_adata.obs["cells"].tolist()), columns = subset_adata.var_names)

it seemed to work.

------------------------------------------------------------------------------------------------------------------------------------------------

A similar problem in line 121 of the source code: When running:

def fbetaTest(x, column, adata, Binary_RankedList, testArray, betaValue = 0.5)

I get this error "ValueError: Shape of passed values is (113957, 1), indices imply (113957, 0)"

But I changed this:

Subset_dataframe = pd.DataFrame(data = subset_adata.X, index = subset_adata.obs_names, columns = subset_adata.var_names)

to this:

Subset_dataframe = pd.DataFrame(data=subset_adata.X.toarray(), index=subset_adata.obs_names.tolist(), columns=subset_adata.var_names)

and it seemed to work.

-------------------------------------------------------------------------------------------------------------------------------------

Another error was in line 94 of the source code: "IndexError: Index dimension must be <= 2"

X = x_train[:, None]

I don't think this code is actually necessary so I commented it out which seemed to fix the problem.

----------------------------------------------------------------------------------------------------------------------------------------------

Then I also got several errors in the last section and I couldn't quite figure out what the problems were but I changed the code from this:

#Move binary genes to Results dataframe
clusters2Genes = pd.DataFrame(columns = ['Gene', 'clusterName'])
clusters2Genes["clusterName"] = Binary_score_store_DF["clusterName"]
clusters2Genes["Gene"] = Binary_score_store_DF.index
GroupedBinarylist = clusters2Genes.groupby('clusterName').apply(lambda x: x['Gene'].unique()) 
BinaryFinal = pd.DataFrame(columns = ['clusterName','Binary_Genes'])
BinaryFinal['clusterName'] = GroupedBinarylist.index
BinaryFinal['Binary_Genes'] = GroupedBinarylist.values

to this:

 Binary_score_store_DF = pd.read_csv('NS-Forest_v3_Extended_Binary_Markers_Supplmental.csv')

# Move binary genes to Results dataframe
clusters2Genes = pd.DataFrame(columns=['Gene', 'clusterName'])
clusters2Genes["clusterName"] = Binary_score_store_DF["clusterName"]
clusters2Genes["Gene"] = Binary_score_store_DF["Unnamed: 0"]
clusters2Genes.to_csv('clusters2Genes.csv')
#GroupedBinarylist = clusters2Genes.groupby('clusterName').apply(lambda x: x['Gene'].unique())
#GroupedBinarylist = clusters2Genes.apply(lambda x: x['Gene'].unique()) #This seemed to work earlier

BinaryFinal = pd.DataFrame(columns=['clusterName', 'Binary_Genes'])
BinaryFinal['clusterName'] =  clusters2Genes["clusterName"]
BinaryFinal['Binary_Genes'] = clusters2Genes["Gene"]
BinaryFinal.to_csv('BinaryFinal.csv')

It seems that in this line of code:

clusters2Genes["clusterName"] = Binary_score_store_DF["clusterName"]

the column name in the Binary_score_store dataframe was incorrect

And in this line of code:

GroupedBinarylist = clusters2Genes.groupby('clusterName').apply(lambda x: x['Gene'].unique())

the clusters are already grouped together and all the gene names are already unique??

Couldn't get around the last problem so ended up just commenting out those lines of code and now I am not sure if my output files are what they should be. Any advice would be greatly appreciated!

adRn-s commented 2 years ago

Hi there, I was just trying to test NS-Forest 3.0 on a Seurat dataset I have. Sadly, I got at least 2 of the very same errors, and applied your fixes. They worked ok... so, thanks for posting! But afterwards I got another different error (key 'cells' not found). I'm not so patience as to fix it... Anyway, I just wanted to comment on this... for the record.

ChristinaSteyn commented 2 years ago

Thanks, really appreciate hearing that someone else had the same problems and it is not just me!

yunzhang813 commented 1 month ago

Thanks of the ticket. Code refactored in v4.0.