cliffren / PENCIL

PENCIL is a novel tool for singlecell data analysis to identify phenotype associated subpopulations and informative genes simultaneously.
GNU General Public License v3.0
25 stars 1 forks source link

The cells were all rejected #3

Closed AcetylCoALab closed 11 months ago

AcetylCoALab commented 1 year ago

Hello, In my data there are 4441 cells in two phenotypes, disease group and healthy group. I followed the tutorial and ran it in R. But the result shows that all my cells are rejected. What is the reason for this and do I need to adjust the parameters? sc_data <- sce

sc_data $cell_phenotype_labels_simulation <- sc_data $group num_classes <- length(unique(sc_data$cell_phenotype_labels_simulation)) pal <- c('gray', hue_pal()(num_classes))

sc_data[["umap"]] <- CreateDimReducObject(embeddings = as.matrix(sc_data@reductions$umap@cell.embeddings), key = "UMAPRaw",assay = "RNA")

exp_data = sc_data@assays[["RNA"]]@scale.data[VariableFeatures(sc_data),] labels = as.factor(sc_data$cell_phenotype_labels_simulation) class_names <- levels(labels) labels_2_ids <- as.numeric(labels) - 1 py_code <- "

  • from pencil import *
  • import scanpy as sc
  • from matplotlib.colors import LinearSegmentedColormap
  • import os
  • os.environ['CUDA_VISIBLE_DEVICES'] = '-1' #select a gpu id, otherwise set to '-1'.
  • import warnings
  • warnings.filterwarnings('ignore') #to ignore warnings on the output.
  • For recording the results.

  • data_name = 'PENCIL_tutorial_1'
  • expr_id = '0.0.1'
  • data = r.exp_data.T.copy()
  • labels = np.array(r.labels_2_ids, dtype=int)
  • mode = 'multi-classification'
  • pencil = Pencil(mode, select_genes=True, seed=1234, data_name=data_name, expr_id=expr_id)
  • pred, confidence = pencil.fit_transform(
  • data, labels,
  • test=True,
  • shuffle_rate=1/4,
  • lambda_L1=1e-5,
  • lambda_L2=1e-3,
  • lr=0.01,
  • class_weights=None,
  • class_names=r.class_names,
  • plot_show=False
  • )" py_run_string(py_code) dataset: PENCIL_tutorial_1, expr_id: 0.0.1 scheme: ce, Sigmoid searching c... cuda is not available, and cpu is used. cmin:0.000, cmax:2.000, c:1.000, rejected 0 cells. cuda is not available, and cpu is used. cmin:0.000, cmax:1.000, c:0.500, rejected 251 cells. cuda is not available, and cpu is used. cmin:0.000, cmax:0.500, c:0.250, rejected 446 cells. cuda is not available, and cpu is used. cmin:0.000, cmax:0.250, c:0.125, rejected 1054 cells. cuda is not available, and cpu is used. cmin:0.000, cmax:0.125, c:0.062, rejected 827 cells. cuda is not available, and cpu is used. cmin:0.000, cmax:0.062, c:0.031, rejected 1988 cells. cuda is not available, and cpu is used. cmin:0.000, cmax:0.031, c:0.016, rejected 3786 cells. searched c: 0.0 pretrain 500 epochs... cuda is not available, and cpu is used. epoch=0, loss=0.0053, mean_e=0.0074, mean_r=-0.0004, L1_reg=164.4811 epoch=20, loss=0.0039, mean_e=0.0063, mean_r=-0.2959, L1_reg=164.7345 epoch=40, loss=0.0041, mean_e=0.0071, mean_r=-0.4233, L1_reg=179.1047 epoch=60, loss=0.0052, mean_e=0.0101, mean_r=-0.5676, L1_reg=204.5981 epoch=80, loss=0.0076, mean_e=0.0177, mean_r=-0.7026, L1_reg=238.3230 epoch=100, loss=0.0103, mean_e=0.0268, mean_r=-0.7766, L1_reg=272.7701 epoch=120, loss=0.0109, mean_e=0.0279, mean_r=-0.7867, L1_reg=297.9885 epoch=140, loss=0.0107, mean_e=0.0266, mean_r=-0.7921, L1_reg=313.5898 epoch=160, loss=0.0105, mean_e=0.0257, mean_r=-0.7938, L1_reg=324.4932 epoch=180, loss=0.0104, mean_e=0.0250, mean_r=-0.7920, L1_reg=332.4452 epoch=200, loss=0.0102, mean_e=0.0241, mean_r=-0.7867, L1_reg=338.6054 epoch=220, loss=0.0101, mean_e=0.0235, mean_r=-0.7813, L1_reg=342.9875 epoch=240, loss=0.0103, mean_e=0.0241, mean_r=-0.7860, L1_reg=348.2928 epoch=260, loss=0.0104, mean_e=0.0245, mean_r=-0.7898, L1_reg=353.2883 epoch=280, loss=0.0105, mean_e=0.0244, mean_r=-0.7906, L1_reg=357.5322 epoch=300, loss=0.0104, mean_e=0.0242, mean_r=-0.7908, L1_reg=361.5029 epoch=320, loss=0.0104, mean_e=0.0240, mean_r=-0.7899, L1_reg=365.2095 epoch=340, loss=0.0104, mean_e=0.0237, mean_r=-0.7884, L1_reg=368.0286 epoch=360, loss=0.0103, mean_e=0.0235, mean_r=-0.7865, L1_reg=371.5217 epoch=380, loss=0.0103, mean_e=0.0232, mean_r=-0.7847, L1_reg=374.9784 epoch=400, loss=0.0102, mean_e=0.0228, mean_r=-0.7817, L1_reg=377.7903 epoch=420, loss=0.0102, mean_e=0.0226, mean_r=-0.7795, L1_reg=380.8744 epoch=440, loss=0.0102, mean_e=0.0228, mean_r=-0.7820, L1_reg=383.9538 epoch=460, loss=0.0103, mean_e=0.0228, mean_r=-0.7814, L1_reg=387.1477 epoch=480, loss=0.0103, mean_e=0.0227, mean_r=-0.7806, L1_reg=389.7986 ---train time: 3473.9056594371796 seconds ---

cuda is not available, and cpu is used. Number of examples rejected= 4441 / 4441 num_of_rejcted KO 2516 WT 1925 Name: count, dtype: int64 --- without rejection --- precision recall f1-score support

      KO       0.99      0.99      0.99      2516
      WT       0.99      0.99      0.99      1925

accuracy                           0.99      4441

macro avg 0.99 0.99 0.99 4441 weighted avg 0.99 0.99 0.99 4441

--- with rejection --- precision recall f1-score support

      KO       0.00      0.00      0.00       0.0
      WT       0.00      0.00      0.00       0.0

micro avg 0.00 0.00 0.00 0.0 macro avg 0.00 0.00 0.00 0.0 weighted avg 0.00 0.00 0.00 0.0

---test time: 2.4009311199188232 seconds ---

cliffren commented 1 year ago

Hi, thanks for your question. In our algorithm, the percentage of rejection can be reduced by increasing the shufflerate. I suggest you try setting shuffle_rate_ to 1/3, 1/2, OR 1.

However, from the results without rejection now, it seems that the cells of the two phenotypes in your data are themselves very well separated, possibly due to overfitting or batch effects. I would suggest that you first test if the results improve by increasing the sparse regularization factor _lamda_L1_ to 1e-4 or 1e-3. Also, can you do a umap visualization to see if there is a batch effect? If so, your data may need to have the batch effect removed before performing the task of abundance testing.

If the two groups of cells are still well separated after the batch effect is removed, there is no need to do a differential cell abundance test analysis. Of course, you can still run pencil, but the effect may just be equivalent to identifying the key differential genes, which are obtained by L1 sparse regularization.

Michael-Geuenich commented 6 months ago

Hi,

I've been having the same issue. I've been able to reproduce the tutorials and am now applying pencil to my own data. I have 13k CD4 and CD8 T cells and a continuous phenotype derived for each sample (23 samples total). I'm trying to run pencil in regression mode using the following code

data_name = 'PENCIL_test'
expr_id = '0.0.1'
mode = 'regression'
pencil = Pencil(mode, select_genes=True, seed=1234, data_name=data_name, expr_id=expr_id, mlflow_record=True, dropouts=[0.0, 0.0]) #`select_genes` can also be set to False, if True, pencil will output a weight vector for all 10 PCs. 
with mlflow.start_run():
    pred, confidence = pencil.fit_transform(
        expression,
        lost_probs, 
        test=True,
        shuffle_rate=1/5,
        lambda_L1=1e-5, 
        lambda_L2=0.0, 
        lr=0.01, 
        epochs=2000, 
        class_weights=None,
        emd=t_cells.obsm['X_umap'],
        plot_show=True
        )

Experiment 'PENCIL_test' already exists.
dataset: PENCIL_test, expr_id: 0.0.1
scheme: sml1, NGLR
searching c...
cmin:0.000, cmax:2.000, c:1.000, rejected 0 cells.
cmin:0.000, cmax:1.000, c:0.500, rejected 0 cells.
cmin:0.000, cmax:0.500, c:0.250, rejected 0 cells.
cmin:0.000, cmax:0.250, c:0.125, rejected 0 cells.
cmin:0.000, cmax:0.125, c:0.062, rejected 0 cells.
cmin:0.000, cmax:0.062, c:0.031, rejected 0 cells.
cmin:0.000, cmax:0.031, c:0.016, rejected 0 cells.
searched c: 0.0
cuda is available.
epoch=0, loss=0.0772, mean_e=0.0542, mean_r=0.0100, L1_reg=22.4290
epoch=20, loss=0.0067, mean_e=0.0441, mean_r=-0.9997, L1_reg=6.6862
epoch=40, loss=0.0048, mean_e=0.0330, mean_r=-0.9953, L1_reg=4.6390
epoch=60, loss=0.0038, mean_e=0.0026, mean_r=-0.9312, L1_reg=3.6371
epoch=80, loss=0.0035, mean_e=0.0014, mean_r=-0.7655, L1_reg=3.2141
epoch=100, loss=0.0037, mean_e=0.0014, mean_r=-0.6842, L1_reg=3.2770
epoch=120, loss=0.0033, mean_e=0.0013, mean_r=-0.6779, L1_reg=2.8206
epoch=140, loss=0.0032, mean_e=0.0013, mean_r=-0.6883, L1_reg=2.8243
epoch=160, loss=0.0035, mean_e=0.0013, mean_r=-0.7045, L1_reg=3.1749
epoch=180, loss=0.0032, mean_e=0.0012, mean_r=-0.7220, L1_reg=2.8259
epoch=200, loss=0.0032, mean_e=0.0012, mean_r=-0.7375, L1_reg=2.8513
epoch=220, loss=0.0035, mean_e=0.0012, mean_r=-0.7532, L1_reg=3.2217
epoch=240, loss=0.0030, mean_e=0.0012, mean_r=-0.7663, L1_reg=2.7499
epoch=260, loss=0.0032, mean_e=0.0012, mean_r=-0.7795, L1_reg=2.9547
epoch=280, loss=0.0034, mean_e=0.0012, mean_r=-0.7879, L1_reg=3.1117
epoch=300, loss=0.0032, mean_e=0.0012, mean_r=-0.7967, L1_reg=2.9369
epoch=320, loss=0.0031, mean_e=0.0012, mean_r=-0.8077, L1_reg=2.8416
epoch=340, loss=0.0034, mean_e=0.0012, mean_r=-0.8158, L1_reg=3.2342
epoch=360, loss=0.0031, mean_e=0.0012, mean_r=-0.8224, L1_reg=2.8802
epoch=380, loss=0.0031, mean_e=0.0012, mean_r=-0.8291, L1_reg=2.9322
epoch=400, loss=0.0034, mean_e=0.0012, mean_r=-0.8333, L1_reg=3.1866
epoch=420, loss=0.0031, mean_e=0.0012, mean_r=-0.8380, L1_reg=2.9135
epoch=440, loss=0.0031, mean_e=0.0012, mean_r=-0.8423, L1_reg=2.9149
epoch=460, loss=0.0033, mean_e=0.0012, mean_r=-0.8465, L1_reg=3.1610
epoch=480, loss=0.0031, mean_e=0.0012, mean_r=-0.8484, L1_reg=2.8924
epoch=500, loss=0.0031, mean_e=0.0012, mean_r=-0.8544, L1_reg=2.8852
epoch=520, loss=0.0034, mean_e=0.0012, mean_r=-0.8484, L1_reg=3.2109
epoch=540, loss=0.0031, mean_e=0.0012, mean_r=-0.8506, L1_reg=2.9159
epoch=560, loss=0.0031, mean_e=0.0012, mean_r=-0.8491, L1_reg=2.9133
epoch=580, loss=0.0034, mean_e=0.0012, mean_r=-0.8474, L1_reg=3.2278
epoch=600, loss=0.0030, mean_e=0.0012, mean_r=-0.8471, L1_reg=2.7783
epoch=620, loss=0.0031, mean_e=0.0012, mean_r=-0.8459, L1_reg=2.9097
epoch=640, loss=0.0032, mean_e=0.0012, mean_r=-0.8460, L1_reg=3.0506
epoch=660, loss=0.0031, mean_e=0.0012, mean_r=-0.8457, L1_reg=2.8719
epoch=680, loss=0.0030, mean_e=0.0012, mean_r=-0.8458, L1_reg=2.8498
epoch=700, loss=0.0034, mean_e=0.0012, mean_r=-0.8474, L1_reg=3.2392
epoch=720, loss=0.0031, mean_e=0.0012, mean_r=-0.8422, L1_reg=2.8924
epoch=740, loss=0.0031, mean_e=0.0012, mean_r=-0.8443, L1_reg=2.8912
epoch=760, loss=0.0033, mean_e=0.0012, mean_r=-0.8459, L1_reg=3.1647
epoch=780, loss=0.0030, mean_e=0.0012, mean_r=-0.8420, L1_reg=2.8526
epoch=800, loss=0.0030, mean_e=0.0012, mean_r=-0.8423, L1_reg=2.8378
epoch=820, loss=0.0032, mean_e=0.0012, mean_r=-0.8424, L1_reg=3.0422
epoch=840, loss=0.0030, mean_e=0.0012, mean_r=-0.8425, L1_reg=2.7999
epoch=860, loss=0.0030, mean_e=0.0012, mean_r=-0.8417, L1_reg=2.8412
epoch=880, loss=0.0033, mean_e=0.0012, mean_r=-0.8452, L1_reg=3.1157
epoch=900, loss=0.0031, mean_e=0.0012, mean_r=-0.8438, L1_reg=2.9542
epoch=920, loss=0.0031, mean_e=0.0012, mean_r=-0.8412, L1_reg=2.8845
epoch=940, loss=0.0034, mean_e=0.0012, mean_r=-0.8399, L1_reg=3.1796
epoch=960, loss=0.0030, mean_e=0.0012, mean_r=-0.8406, L1_reg=2.8273
epoch=980, loss=0.0031, mean_e=0.0012, mean_r=-0.8394, L1_reg=2.9202
epoch=1000, loss=0.0032, mean_e=0.0012, mean_r=-0.8405, L1_reg=3.0560
epoch=1020, loss=0.0030, mean_e=0.0012, mean_r=-0.8382, L1_reg=2.7904
epoch=1040, loss=0.0030, mean_e=0.0012, mean_r=-0.8401, L1_reg=2.8022
epoch=1060, loss=0.0034, mean_e=0.0012, mean_r=-0.8395, L1_reg=3.1635
epoch=1080, loss=0.0030, mean_e=0.0012, mean_r=-0.8403, L1_reg=2.7840
epoch=1100, loss=0.0032, mean_e=0.0012, mean_r=-0.8410, L1_reg=2.9610
epoch=1120, loss=0.0033, mean_e=0.0012, mean_r=-0.8379, L1_reg=3.1172
epoch=1140, loss=0.0031, mean_e=0.0012, mean_r=-0.8381, L1_reg=2.8863
epoch=1160, loss=0.0030, mean_e=0.0012, mean_r=-0.8381, L1_reg=2.8226
epoch=1180, loss=0.0033, mean_e=0.0012, mean_r=-0.8394, L1_reg=3.0674
epoch=1200, loss=0.0030, mean_e=0.0012, mean_r=-0.8380, L1_reg=2.8104
epoch=1220, loss=0.0030, mean_e=0.0012, mean_r=-0.8374, L1_reg=2.8351
epoch=1240, loss=0.0033, mean_e=0.0012, mean_r=-0.8382, L1_reg=3.0904
epoch=1260, loss=0.0031, mean_e=0.0012, mean_r=-0.8374, L1_reg=2.8696
epoch=1280, loss=0.0030, mean_e=0.0012, mean_r=-0.8363, L1_reg=2.8468
epoch=1300, loss=0.0032, mean_e=0.0012, mean_r=-0.8369, L1_reg=3.0555
epoch=1320, loss=0.0030, mean_e=0.0012, mean_r=-0.8397, L1_reg=2.8025
epoch=1340, loss=0.0030, mean_e=0.0012, mean_r=-0.8375, L1_reg=2.8607
epoch=1360, loss=0.0032, mean_e=0.0012, mean_r=-0.8363, L1_reg=3.0162
epoch=1380, loss=0.0030, mean_e=0.0012, mean_r=-0.8365, L1_reg=2.8288
epoch=1400, loss=0.0030, mean_e=0.0012, mean_r=-0.8352, L1_reg=2.8036
epoch=1420, loss=0.0033, mean_e=0.0012, mean_r=-0.8362, L1_reg=3.1405
epoch=1440, loss=0.0030, mean_e=0.0012, mean_r=-0.8362, L1_reg=2.8053
epoch=1460, loss=0.0031, mean_e=0.0018, mean_r=-0.8514, L1_reg=2.8640
epoch=1480, loss=0.0033, mean_e=0.0012, mean_r=-0.8249, L1_reg=3.0993
epoch=1500, loss=0.0029, mean_e=0.0012, mean_r=-0.8331, L1_reg=2.7509
epoch=1520, loss=0.0030, mean_e=0.0012, mean_r=-0.8377, L1_reg=2.8196
epoch=1540, loss=0.0033, mean_e=0.0012, mean_r=-0.8372, L1_reg=3.0848
epoch=1560, loss=0.0030, mean_e=0.0012, mean_r=-0.8369, L1_reg=2.8442
epoch=1580, loss=0.0030, mean_e=0.0012, mean_r=-0.8362, L1_reg=2.8145
epoch=1600, loss=0.0033, mean_e=0.0012, mean_r=-0.8366, L1_reg=3.0936
epoch=1620, loss=0.0030, mean_e=0.0012, mean_r=-0.8359, L1_reg=2.8547
epoch=1640, loss=0.0030, mean_e=0.0012, mean_r=-0.8364, L1_reg=2.8008
epoch=1660, loss=0.0033, mean_e=0.0012, mean_r=-0.8349, L1_reg=3.1166
epoch=1680, loss=0.0030, mean_e=0.0012, mean_r=-0.8364, L1_reg=2.8173
epoch=1700, loss=0.0030, mean_e=0.0012, mean_r=-0.8365, L1_reg=2.8408
epoch=1720, loss=0.0032, mean_e=0.0012, mean_r=-0.8359, L1_reg=3.0038
epoch=1740, loss=0.0030, mean_e=0.0012, mean_r=-0.8365, L1_reg=2.8148
epoch=1760, loss=0.0029, mean_e=0.0012, mean_r=-0.8358, L1_reg=2.7558
epoch=1780, loss=0.0033, mean_e=0.0012, mean_r=-0.8361, L1_reg=3.0945
epoch=1800, loss=0.0030, mean_e=0.0012, mean_r=-0.8359, L1_reg=2.8057
epoch=1820, loss=0.0031, mean_e=0.0012, mean_r=-0.8350, L1_reg=2.9209
epoch=1840, loss=0.0032, mean_e=0.0012, mean_r=-0.8355, L1_reg=3.0503
epoch=1860, loss=0.0030, mean_e=0.0012, mean_r=-0.8360, L1_reg=2.8206
epoch=1880, loss=0.0030, mean_e=0.0012, mean_r=-0.8365, L1_reg=2.8110
epoch=1900, loss=0.0032, mean_e=0.0012, mean_r=-0.8349, L1_reg=3.0199
epoch=1920, loss=0.0029, mean_e=0.0012, mean_r=-0.8342, L1_reg=2.7262
epoch=1940, loss=0.0030, mean_e=0.0012, mean_r=-0.8346, L1_reg=2.8230
epoch=1960, loss=0.0033, mean_e=0.0012, mean_r=-0.8346, L1_reg=3.1077
epoch=1980, loss=0.0031, mean_e=0.0012, mean_r=-0.8348, L1_reg=2.8800

I've tried changing the shuffle rate anywhere from 1 to 1/1000000, I've also tried setting lambda_L1 to 1e-3 and 1e-4 as suggested above (while keeping the shuffle rate at 1/5). This is my loss curve:

image

And this is the final output:

image

Any help would be much appreciated.

On a different note: how does pencil expect data to be normalized? For the expression I've assumed library-size normalized logcounts are expected but I don't seem to be able to find any documentation on this. Similarly, for regression problems, would it be beneficial to scale the labels?