combio-dku / MarkerCount

MarkerCount is a python3 cell-type identification toolkit for single-cell RNA-Seq experiments.
MIT License
0 stars 1 forks source link

Imput X contains NaN #1

Open GERMAN00VP opened 9 months ago

GERMAN00VP commented 9 months ago

Hi, I'm using the MarkerCount function to predict the cell type labels in my sc RNA-Seq data, but when I call the function like that:

df_res = MarkerCount( X=X, mkr_mat=marker_matrix, log_transformed = True, verbose = True )

It produces this error message:

ValueError: Input X contains NaN. GaussianMixture does not accept missing values encoded as NaN natively. (...)

I have checked both input data and didnt find NaN values so I dont know what else to do.

My X dataframe looks like this:

Captura de pantalla de 2023-10-16 18-29-14

And my marker_matrix like this:

marker_matrix

combio-dku commented 9 months ago

Hi, I guess the problem is the input matrix contains negative values. I guess you z-score scaled after the log-transformation. Why don't you try without z-score scaling so that the matrix contains only non-negative values. By the way, there is an upgrade version of MarkerCount; called HiCAT, https://github.com/combio-dku/hicat/. I suggest to try it as it performs better and is more stable.

combio-dku commented 9 months ago

If you want to try HiCAT to annotate pancreas tissue, you may use this one. (You can edit the list of markers with yours) cell_markers_rndsystems_with_pancreas_hs.tsv.txt

GERMAN00VP commented 9 months ago

Hi, I tried to use the non scaled data but i'm still getting the same problem. I also tried the HiCAT package and I got a "LinAlgError: SVD did not converge in Linear Least Squares".

The X data I'm using is (it is a subsample) :

Cells_expression_matrix_subsample.tsv.txt

The cell_markers data I'm using is the one you gave me.

Thank you for your help!

combio-dku commented 9 months ago

Hi, I looked at your data and find that it is quite different from normal single-cell count matrix (even if it is normalized and log-transformed.) Most of all, all the entries in the data below are all non-zero. Since both hicat and markerCount uses binary information, either express or not, if all the expression value is non-zero (greater than 0), the number of expressed marker genes will be the same for all the cells. I think this was the problem why HiCAT and MarkerCount issued error.

Therefore, I tried as follows

  1. First find the median value from your matrix, i.e., med = X.median().median()
  2. Then, set entries to zero if it is less than the median value, X = X*(X >= med)
  3. Then, half of the entries in X will be zero
  4. With this matrix, I ran HiCAT and got the result hicat_result.csv

As I said, HiCAT and MarkerCount uses binary information (either expressed or not). And some of expression values must be zeros for the tools to work properly. I hope this resolve your problem.

GERMAN00VP commented 9 months ago

Yes it resolved my problem, thank you so much for your help!