digitalcytometry / cytotrace2

CytoTRACE 2 is an interpretable AI method for predicting cellular potency and absolute developmental potential from scRNA-seq data.
Other
60 stars 3 forks source link

Something wrong in the python version of cytotrace2 #21

Closed dicklim closed 2 weeks ago

dicklim commented 3 weeks ago

Hi, I am running the python version of cytotrace. Here are some advice and some problems to report.

  1. the imput was a file path. And the expression matrix was transversed in the load() script. and the matrix is the same shape with the scanpy anndata.X .So can the scanpy matrix beused as the input? (I have changed the script expression = input_anndata.X (line 113 of cytotrace2_py.py) and it works)
  2. line 120 of common/gen_utils.py. The [i for i in duplicate_genes if i is not np.nan] is wierd, and can raise error. I changed to [i for i in duplicate_genes if not pd.isna(i)] .I am not sure if it's the version of pandas that cause the mistake. (My version is pandas 2.2.2)
  3. line 182 of cytotrace2_py.py. the predicted_df_final = predicted_df_final.loc[original_names] seems not work for my version of python. The original_names is a Index value, i am not sure if it canslice the dataframe well ( for my version of pandas )
dicklim commented 3 weeks ago

sorry. the point 3 is my fault. the . symble in the cell name was changed to -. I don't know wht

Niubile001 commented 2 weeks ago

sorry. the point 3 is my fault. the . symble in the cell name was changed to -. I don't know wht

Hello,

Did the modifications to the original Python script work successfully? I agree that it would be ideal if we could simply provide an Anndata object as input, set an ‘obs’ name as the annotation information, and then obtain all the results. By the way, the change from . to - in the cell names might be due to the procedures in the Seurat package.

Best regards,

savagyan00 commented 2 weeks ago

Hi, Thank you for using CytoTRACE 2 and for your efforts in finding workarounds for the issues you have experienced:

  1. Currently, our tool requires files in a tab-delimited format (.txt) without double quotations. We appreciate your feedback and will consider supporting a filepath to Anndata objects in future updates to make our tool more versatile and user-friendly.
  2. I've tested it with pandas 2.2.2 but couldn't replicate the issue. Could you please share the exact error message you encountered, and also your numpy version? This will help us better understand and address the problem.
  3. Python implementation of our tool interacts with R scripts, and R might modify the indices that contain "-" to "." when reading a file, so we manually handle this possible conversion. Is this still an issue for you or it ran without errors?

We're grateful for your feedback! If you have any more suggestions or need further assistance, please don't hesitate to reach out.

dicklim commented 2 weeks ago

Hi, Thank you for using CytoTRACE 2 and for your efforts in finding workarounds for the issues you have experienced:

1. Currently, our tool requires files in a tab-delimited format (.txt) without double quotations. We appreciate your feedback and will consider supporting a filepath to Anndata objects in future updates to make our tool more versatile and user-friendly.

2. I've tested it with pandas 2.2.2 but couldn't replicate the issue. Could you please share the exact error message you encountered, and also your numpy version? This will help us better understand and address the problem.

3. Python implementation of our tool interacts with R scripts, and R might modify the indices that contain "-" to "." when reading a file, so we manually handle this possible conversion. Is this still an issue for you or it ran without errors?

We're grateful for your feedback! If you have any more suggestions or need further assistance, please don't hesitate to reach out.

Hi, thanks for reply.

  1. There is no problem to use a tab-delimited format (.txt) as input. The only concern for me is that the scRNA is processed using Scanpy. It will takes a lot of time to generate a txt file if I preocess a large number of cells (the LuCA for example). Anyway, this is not a big problem.
  2. Well my version of numpy is 1.26.4. I can't excatly remember the error message now. My solution was to change that code to '[i for i in duplicate_genes if not pd.isna(i)]' .
  3. Yes, I think it's the R script that makes the problem. My solution was to use dict to map the origin name of the cell and the output of CytoTrace2.

The issues I mentioned above have all been solved now. Thanks again for providing such nice software!

savagyan00 commented 2 weeks ago

Hi and thanks for your response,

I am glad to hear all the issues are resolved, and hope our tool continues to be useful in your work. We appreciate your feedback which helps us continuously improve our tool.

Please feel free to reach out in case of any other questions!

Niubile001 commented 1 week ago

Hi, Thank you for using CytoTRACE 2 and for your efforts in finding workarounds for the issues you have experienced:

  1. Currently, our tool requires files in a tab-delimited format (.txt) without double quotations. We appreciate your feedback and will consider supporting a filepath to Anndata objects in future updates to make our tool more versatile and user-friendly.
  2. I've tested it with pandas 2.2.2 but couldn't replicate the issue. Could you please share the exact error message you encountered, and also your numpy version? This will help us better understand and address the problem.
  3. Python implementation of our tool interacts with R scripts, and R might modify the indices that contain "-" to "." when reading a file, so we manually handle this possible conversion. Is this still an issue for you or it ran without errors?

We're grateful for your feedback! If you have any more suggestions or need further assistance, please don't hesitate to reach out.

Hello,

Thank you for the great tool. I recently encountered an issue that returned an error, as shown below. I am not sure if this error was caused by manually modifying "-" and "." during data processing. I am concerned that it might cause an error when the original cell ID we provide contains both "-" and ".".

Best regards,

———————————————————————————————————————————————————————— cytotrace2: Input parameters Input file: Data/adata_epi_alveolar_expdata_sample.txt Species: human Full model: False Parallelization enabled: True User-provided limit for number of cores to use: None Batch size: 10000 Smoothing batch size: 1000 Max PCs: 200 Seed: 14 Output directory: cytotrace2_results cytotrace2: Loading dataset cytotrace2: Dataset characteristics Number of input genes: 47311 Number of input cells: 5000 cytotrace2: The passed batch_size is greater than the number of cells in the subsample. Now setting batch_size to 5000. cytotrace2: Preprocessing cytotrace2: 36 cores detected cytotrace2: Running 1 prediction batch(es) in parallel using 10 cores for smoothing per batch. cytotrace2: Initiated processing batch 1/1 with 5000 cells Mapped 14292 input gene names to mouse orthologs 14262 input genes are present in the model features. |======================================================================| 100%

KeyError Traceback (most recent call last) Cell In[16], line 3 1 from cytotrace2_py.cytotrace2_py import * ----> 3 results = cytotrace2("Data/adata_epi_alveolar_expdata_sample.txt", 4 annotation_path = "Data/adata_epi_alveolar_annotation_sample.txt", 5 species = "human" 6 )

File ~/mambaforge/envs/cytotrace2-py/lib/python3.9/site-packages/cytotrace2_py/cytotrace2_py.py:182, in cytotrace2(input_path, annotation_path, species, full_model, batch_size, smooth_batch_size, disable_parallelization, max_cores, max_pcs, seed, output_dir) 179 os.remove(fin) 181 predicted_df_final = pd.concat(predictions, ignore_index=False) --> 182 predicted_df_final = predicted_df_final.loc[original_names] 183 ranges = np.linspace(0, 1, 7)
184 labels = [ 185 'Differentiated', 186 'Unipotent', (...) 189 'Pluripotent', 190 'Totipotent']

File ~/mambaforge/envs/cytotrace2-py/lib/python3.9/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.getitem(self, key) 1189 maybe_callable = com.apply_if_callable(key, self.obj) 1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable) -> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/mambaforge/envs/cytotrace2-py/lib/python3.9/site-packages/pandas/core/indexing.py:1420, in _LocIndexer._getitem_axis(self, key, axis) 1417 if hasattr(key, "ndim") and key.ndim > 1: 1418 raise ValueError("Cannot index with multidimensional key") -> 1420 return self._getitem_iterable(key, axis=axis) 1422 # nested tuple slicing 1423 if is_nested_tuple(key, labels):

File ~/mambaforge/envs/cytotrace2-py/lib/python3.9/site-packages/pandas/core/indexing.py:1360, in _LocIndexer._getitem_iterable(self, key, axis) 1357 self._validate_key(key, axis) 1359 # A collection of keys -> 1360 keyarr, indexer = self._get_listlike_indexer(key, axis) 1361 return self.obj._reindex_with_indexers( 1362 {axis: [keyarr, indexer]}, copy=True, allow_dups=True 1363 )

File ~/mambaforge/envs/cytotrace2-py/lib/python3.9/site-packages/pandas/core/indexing.py:1558, in _LocIndexer._get_listlike_indexer(self, key, axis) 1555 ax = self.obj._get_axis(axis) 1556 axis_name = self.obj._get_axis_name(axis) -> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name) 1560 return keyarr, indexer

File ~/mambaforge/envs/cytotrace2-py/lib/python3.9/site-packages/pandas/core/indexes/base.py:6200, in Index._get_indexer_strict(self, key, axis_name) 6197 else: 6198 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr) -> 6200 self._raise_if_missing(keyarr, indexer, axis_name) 6202 keyarr = self.take(indexer) 6203 if isinstance(key, Index): 6204 # GH 42790 - Preserve name from an Index

File ~/mambaforge/envs/cytotrace2-py/lib/python3.9/site-packages/pandas/core/indexes/base.py:6252, in Index._raise_if_missing(self, key, indexer, axis_name) 6249 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique()) -> 6252 raise KeyError(f"{not_found} not in index")

KeyError: "['bcEOEW_NSC010.t2', 'bcGOJO_NSC010.t1', 'bcFDRC_NSC020.t2', 'bcEAHW_NSC010.t2', 'bcFSTV_NSC016.t3', 'bcDKOO_NSC010.t1', 'bcBOHF_NSC018.t1', 'bcFRSP_NSC019.t1', 'bcHKDE_NSC010.t2', 'bcAIVE_NSC010.t1', 'bcGXZH_NSC019.t1', 'bcFXYQ_NSC010.t2', 'bcBHQT_NSC021.t2', 'bcDTXJ_NSC016.t2', 'bcHYQD_NSC016.t3', 'bcFRUQ_NSC019.t2', 'bcAKUO_NSC010.t2', 'bcCMXV_NSC035.t1', 'bcCRZC_NSC010.t1', 'bcHZFD_NSC010.t1', 'bcGQAN_NSC020.t2', 'bcEFAV_NSC010.t1', 'bcGIYS_NSC010.t2', 'bcBKEG_NSC010.t2', 'bcHWLG_NSC010.t1', 'bcGOCG_NSC019.t2', 'bcIJHL_NSC010.t1', 'bcHNIK_NSC016.t2', 'bcAXYX_NSC016.t3', 'bcHBNH_NSC010.t2', 'bcEOMX_NSC010.t1', 'bcCOEW_NSC010.t1', 'bcDCAS_NSC020.t2', 'bcBAML_NSC018.t2', 'bcIHTH_NSC010.t2', 'bcFULN_NSC010.t1', 'bcCBYU_NSC010.t2', 'bcGMXP_NSC010.t1', 'bcCJLW_NSC010.t2', 'bcIAFX_NSC016.t2', 'bcDGBJ_NSC010.t2', 'bcBFPE_NSC019.t2', 'bcAPSH_NSC019.t1', 'bcEZNA_NSC010.t2', 'bcHTKK_NSC019.t2', 'bcEBEX_NSC019.t1', 'bcCCWF_NSC020.t2', 'bcGVRW_NSC010.t2', 'bcHIMV_NSC021.t1', 'bcIGRY_NSC010.t1', 'bcCFPL_NSC019.t2', 'bcIBWU_NSC010.t1', 'bcHEWC_NSC020.t1', 'bcAVAG_NSC010.t1', 'bcEFVU_NSC019.t1', 'bcHBGR_NSC010.t2', 'bcBKUO_NSC019.t1', 'bcHZOU_NSC021.t1', 'bcHRQA_NSC010.t1', 'bcFBHL_NSC021.t1', 'bcHEOO_NSC019.t1', 'bcDECB_NSC019.t1', 'bcBNHY_NSC021.t2', 'bcGTUC_NSC010.t2', 'bcICRW_NSC016.t3', 'bcAROJ_NSC021.t1', 'bcFEBX_NSC010.t1', 'bcGNJL_NSC016.t1', 'bcDQSI_NSC036.t1', 'bcHQMV_NSC010.t2', 'bcGKUL_NSC019.t1', 'bcDRDV_NSC010.t1', 'bcIFVH_NSC010.t1', 'bcEIUJ_NSC019.t2', 'bcGTIY_NSC010.t1', 'bcEPQW_NSC010.t2', 'bcGLJR_NSC016.t3', 'bcGOTX_NSC010.t2', 'bcCPYK_NSC010.t2', 'bcDGXC_NSC016.t2', 'bcIELM_NSC019.t1', 'bcAKAS_NSC019.t1', 'bcFHKF_NSC019.t2', 'bcFXNR_NSC019.t2', 'bcDDKJ_NSC010.t2', 'bcAKFM_NSC019.t1', 'bcBFOK_NSC010.t2', 'bcGWXH_NSC018.t2', 'bcHZAE_NSC010.t1', 'bcHAZX_NSC020.t1', 'bcHNRD_NSC019.t2', 'bcCDPU_NSC018.t3', 'bcHPKN_NSC016.t1', 'bcIGQK_NSC010.t2', 'bcFFNM_NSC010.t1', 'bcCZRO_NSC035.t1', 'bcBLTS_NSC010.t2', 'bcHTYZ_NSC019.t2', 'bcGUYL_NSC019.t1', 'bcATBL_NSC016.t2', 'bcHAWL_NSC018.t3', 'bcHCAO_NSC010.t2', 'bcGXHJ_NSC010.t1', 'bcHXUM_NSC010.t1', 'bcGBHB_NSC016.t3', 'bcAXRN_NSC010.t1', 'bcDEHE_NSC019.t2', 'bcBPYZ_NSC010.t1', 'bcCVZC_NSC010.t1', 'bcCXTI_NSC021.t1', 'bcESBU_NSC016.t1', 'bcEXLK_NSC010.t1', 'bcGXSF_NSC010.t1', 'bcGCXJ_NSC021.t2', 'bcEOAX_NSC019.t2', 'bcHGIV_NSC018.t1', 'bcCRTO_NSC010.t2', 'bcHBBM_NSC010.t1', 'bcENUY_NSC010.t2', 'bcHGJR_NSC019.t1', 'bcBOHF_NSC018.t2', 'bcHTOP_NSC018.t2', 'bcGNHP_NSC010.t2', 'bcCVVE_NSC010.t2', 'bcGOAT_NSC010.t1', 'bcFDWC_NSC010.t1', 'bcAODT_NSC010.t2', 'bcGHTV_NSC016.t2', 'bcHESA_NSC010.t2', 'bcAAUF_NSC035.t1', 'bcCZVG_NSC010.t1', 'bcGZTG_NSC016.t2', 'bcHLTG_NSC010.t2', 'bcAOSZ_NSC010.t2', 'bcHJTY_NSC010.t2', 'bcBIWX_NSC018.t1', 'bcGFTV_NSC016.t2', 'bcHDSP_NSC010.t2', 'bcHVPN_NSC010.t1', 'bcCXWV_NSC010.t1', 'bcGNDM_NSC020.t1', 'bcGPPA_NSC016.t2', 'bcBFCS_NSC010.t1', 'bcGBAV_NSC019.t1', 'bcDUIY_NSC010.t2', 'bcHEKB_NSC010.t1', 'bcFCFY_NSC010.t1', 'bcCMCJ_NSC010.t2', 'bcGCHW_NSC010.t1', 'bcHLMR_NSC010.t2', 'bcEYZC_NSC010.t2', 'bcEBWS_NSC040.t1', 'bcGADN_NSC016.t2', 'bcGANQ_NSC010.t2', 'bcCFTX_NSC010.t1', 'bcDHEH_NSC018.t2', 'bcHJFV_NSC010.t2', 'bcAAZF_NSC019.t2', 'bcCVEY_NSC010.t1', 'bcHXNJ_NSC010.t2', 'bcEPVD_NSC010.t2', 'bcAFWJ_NSC021.t2', 'bcFRGM_NSC018.t3', 'bcEKST_NSC016.t3', 'bcFNSM_NSC016.t1', 'bcDGLQ_NSC010.t1', 'bcHPQR_NSC010.t2', 'bcBVEC_NSC010.t1', 'bcCEBE_NSC010.t1', 'bcBOGV_NSC010.t2', 'bcHRJJ_NSC010.t2', 'bcGZDK_NSC016.t3', 'bcGHGR_NSC010.t2', 'bcAFIR_NSC016.t1', 'bcHXPZ_NSC010.t1', 'bcIGOF_NSC018.t3', 'bcHNTK_NSC016.t3', 'bcBPOR_NSC019.t2', 'bcEOQG_NSC010.t1', 'bcDFOX_NSC016.t1', 'bcHDCX_NSC020.t2', 'bcBPQT_NSC019.t1', 'bcESLV_NSC016.t3', 'bcFNTX_NSC010.t1', 'bcAXYU_NSC010.t2', 'bcAZCM_NSC010.t2', 'bcHHDO_NSC010.t2', 'bcESGJ_NSC010.t1', 'bcFGES_NSC016.t1', 'bcDCMM_NSC040.t1', 'bcHLRQ_NSC010.t1', 'bcDEIK_NSC016.t2', 'bcAORX_NSC010.t2', 'bcHIYY_NSC010.t1', 'bcEFOK_NSC016.t1', 'bcFEWM_NSC010.t2', 'bcDDAB_NSC010.t2', 'bcCIIS_NSC019.t2', 'bcDXLV_NSC010.t1', 'bcDAOD_NSC010.t2', 'bcEEZH_NSC020.t1', 'bcHPZT_NSC019.t2', 'bcGBLM_NSC010.t2', 'bcHPOC_NSC010.t1', 'bcHOZL_NSC010.t1', 'bcEMJE_NSC010.t2', 'bcIIKL_NSC010.t1', 'bcFZFM_NSC010.t2', 'bcCGCF_NSC010.t2', 'bcFUPD_NSC010.t1', 'bcGZVI_NSC016.t3', 'bcFEPP_NSC016.t2', 'bcESCR_NSC016.t3', 'bcEWEY_NSC019.t2', 'bcEYUP_NSC010.t2', 'bcCDOA_NSC010.t2', 'bcENPM_NSC021.t1', 'bcFQEK_NSC010.t2', 'bcHWMH_NSC010.t2', 'bcGKDB_NSC010.t1', 'bcBLAQ_NSC010.t2', 'bcELLS_NSC010.t1', 'bcFNDE_NSC016.t3', 'bcEADY_NSC010.t2', 'bcEQDD_NSC016.t3', 'bcGXFV_NSC018.t1', 'bcCLAZ_NSC016.t3', 'bcEPUB_NSC010.t2', 'bcHZDZ_NSC020.t1', 'bcGGZP_NSC010.t2', 'bcDJXQ_NSC021.t1', 'bcGOYZ_NSC016.t3', 'bcBCNJ_NSC010.t2', 'bcFLEV_NSC018.t1', 'bcAFAL_NSC019.t1', 'bcBONS_NSC016.t3', 'bcHQUE_NSC010.t2', 'bcICUF_NSC016.t3', 'bcEFKE_NSC010.t1', 'bcEURY_NSC010.t1', 'bcFKVP_NSC010.t1', 'bcDVIN_NSC010.t1', 'bcDNPZ_NSC019.t2', 'bcDCRB_NSC010.t1', 'bcFKQF_NSC010.t1', 'bcFVCF_NSC019.t2', 'bcFWUM_NSC010.t2', 'bcGIDS_NSC016.t2', 'bcHPNO_NSC010.t1', 'bcFKBB_NSC010.t1', 'bcEPDR_NSC010.t1', 'bcHYRJ_NSC019.t2', 'bcAVON_NSC010.t1', 'bcAVJR_NSC019.t2', 'bcFVDK_NSC010.t1', 'bcDMUV_NSC035.t1', 'bcGJJZ_NSC019.t1', 'bcHWLE_NSC010.t1', 'bcDCHJ_NSC010.t2', 'bcDLAZ_NSC010.t1', 'bcFXEG_NSC016.t3', 'bcFBSB_NSC010.t2', 'bcHUCJ_NSC010.t1', 'bcHBIJ_NSC019.t2', 'bcIIEW_NSC010.t1', 'bcDNJC_NSC010.t2', 'bcGSEM_NSC010.t2', 'bcCQPO_NSC010.t2', 'bcBQGD_NSC010.t1', 'bcEDJZ_NSC010.t2', 'bcDRBF_NSC010.t2', 'bcGYZY_NSC010.t1', 'bcFDWU_NSC019.t1'] not in index"