Closed nh3 closed 2 years ago
Hi @zktuong @prete would you mind giving this a test?
Hi @nh3 I noticed that cpdb_statistical_analysis_helper.py
now imports numpy_groupies
. Does this import require package numpy-groupies
? if so requirements.txt
/win-requirements.txt
needs to be updated as well
Is it possible to incorporate some change on PR #275 cellphonedb/src/core/methods/cpdb_statistical_analysis_complex_method.py
to speed up the counts_validation step?
Basically it stops converting the numpy arrays to np.float32 twice which causes issues #282.
I can just delete PR #275 after that.
and also the type conversion in cellphonedb/src/core/methods/method_launcher.py
I'm going to publish this as-is as a CellPhoneDB 2.1.8 beta version 1 so it can passed around for beta testing before merging this to master.
Please feel free to share that with users who can test this for us.
EDIT: refer to latter comment
Thanks @prete and @zktuong !
.to_numpy()
seems to cause shuffle_meta() not actually shuffling in spawned processes. have to revert back to using .values
.
I'm going to publish this as-is as a CellPhoneDB 2.1.8 beta version 1 so it can passed around for beta testing before merging this to master.
conda create -n cpdb-test python=3.8 conda activate cpdb-test pip install -U CellPhoneDB==2.1.8b1
Please feel free to share that with users who can test this for us.
Hi @prete , could you make commit 73447f7 as a beta release instead please? It reverts to use .values
for shuffle_meta()
and included a fix to make sure results are correct.
Many thanks!
Hi @prete , could you make commit 73447f7 as a beta release instead please? It reverts to use
.values
forshuffle_meta()
and included a fix to make sure results are correct.
CellPhoneDB==2.1.8b1
was build based on d034551.
So it does use .values
, but it doesn't have meta = meta.loc[counts.columns]
. You need both, right? If so, I'll delete release 2.1.8b1 and re-publish it.
CellPhoneDB==2.1.8b1
was build based on d034551. So it does use.values
, but it doesn't havemeta = meta.loc[counts.columns]
. You need both, right? If so, I'll delete release 2.1.8b1 and re-publish it.
Yes, please. Many thanks!
@nh3 can't overwrite an already published version, so for testing purposes please refer to 2.1.8 beta version 2
conda create -n cpdb-test python=3.8
conda activate cpdb-test
pip install -U CellPhoneDB==2.1.8b2
Hi,
I want to point out that when testing either 2.1.8b2 or 2.1.8b1, I get the following error when running analysis using as input meta data and h5ad file:
[ ][APP][20/03/21-17:54:19][WARNING] Latest local available version is v2.0.0
, using it
[ ][APP][20/03/21-17:54:19][WARNING] User selected downloaded database v2.0.0
is available, using it
[ ][CORE][20/03/21-17:54:19][INFO] Initializing SqlAlchemy CellPhoneDB Core
[ ][CORE][20/03/21-17:54:19][INFO] Using custom database at /home/devel/tin/.cpdb/releases/v2.0.0/cellphone.db
[ ][APP][20/03/21-17:54:19][INFO] Launching Method cpdb_statistical_analysis_local_method_launcher
[ ][APP][20/03/21-17:54:19][INFO] Launching Method _set_paths
[ ][APP][20/03/21-17:54:19][INFO] Launching Method _load_meta_counts
[ ][CORE][20/03/21-17:56:15][INFO] Launching Method cpdb_statistical_analysis_launcher
[ ][CORE][20/03/21-17:56:15][INFO] Launching Method _counts_validations
[ ][APP][20/03/21-17:56:28][ERROR] Invalid Counts data: Some cells in meta didnt exist in counts columns. Maybe incorrect file format.
This does not happen on the version 2.1.7 with the same input and the analysis runs perfect on that version
Hi,
I want to point out that when testing either 2.1.8b2 or 2.1.8b1, I get the following error when running analysis using as input meta data and h5ad file:
[ ][APP][20/03/21-17:54:19][WARNING] Latest local available version is
v2.0.0
, using it [ ][APP][20/03/21-17:54:19][WARNING] User selected downloaded databasev2.0.0
is available, using it [ ][CORE][20/03/21-17:54:19][INFO] Initializing SqlAlchemy CellPhoneDB Core [ ][CORE][20/03/21-17:54:19][INFO] Using custom database at /home/devel/tin/.cpdb/releases/v2.0.0/cellphone.db [ ][APP][20/03/21-17:54:19][INFO] Launching Method cpdb_statistical_analysis_local_method_launcher [ ][APP][20/03/21-17:54:19][INFO] Launching Method _set_paths [ ][APP][20/03/21-17:54:19][INFO] Launching Method _load_meta_counts [ ][CORE][20/03/21-17:56:15][INFO] Launching Method cpdb_statistical_analysis_launcher [ ][CORE][20/03/21-17:56:15][INFO] Launching Method _counts_validations [ ][APP][20/03/21-17:56:28][ERROR] Invalid Counts data: Some cells in meta didnt exist in counts columns. Maybe incorrect file format.This does not happen on the version 2.1.7 with the same input and the analysis runs perfect on that version
I can't reproduce this error.
The error message is coming from:
meta.index = meta.index.astype(str)
if np.any(~counts.columns.isin(meta.index)):
raise ParseCountsException('Some cells in meta didnt exist in counts columns',
'Maybe incorrect file format')
which means there's a mismatch in the rownames for your metadata file (which should be a two column file with no row numbers; cell barcodes are in the first column), and the cell barcodes in the columns of the expression matrix (or indices in the .obs slot in your h5ad file).
Thank you @Victor21v and @zktuong. There is an unintentional change of logic in that code that eluded my attention.
It used to require cells in meta
exist in counts
, however it has now become requiring cells in counts
exist in meta
.
I guess there is value in allowing meta
to contain a subset of cells in counts
, i.e. testing on only a subset of data, as it's much cheaper/more flexible to change meta
than counts
.
I've restored the previous behaviour.
Thank you @Victor21v and @zktuong. There is an unintentional change of logic in that code that eluded my attention. It used to require cells in
meta
exist incounts
, however it has now become requiring cells incounts
exist inmeta
. I guess there is value in allowingmeta
to contain a subset of cells incounts
, i.e. testing on only a subset of data, as it's much cheaper/more flexible to changemeta
thancounts
. I've restored the previous behaviour.
Thanks @nh3
Now I have another error when starting the analysis:
[ ][APP][22/03/21-14:31:41][WARNING] Latest local available version is `v2.0.0`, using it
[ ][APP][22/03/21-14:31:41][WARNING] User selected downloaded database `v2.0.0` is available, using it
[ ][CORE][22/03/21-14:31:41][INFO] Initializing SqlAlchemy CellPhoneDB Core
[ ][CORE][22/03/21-14:31:41][INFO] Using custom database at /home/devel/tin/.cpdb/releases/v2.0.0/cellphone.db
[ ][APP][22/03/21-14:31:41][INFO] Launching Method cpdb_statistical_analysis_local_method_launcher
[ ][APP][22/03/21-14:31:41][INFO] Launching Method _set_paths
[ ][APP][22/03/21-14:31:41][INFO] Launching Method _load_meta_counts
[ ][CORE][22/03/21-14:33:34][INFO] Launching Method cpdb_statistical_analysis_launcher
[ ][CORE][22/03/21-14:33:34][INFO] Launching Method _counts_validations
[ ][CORE][22/03/21-14:33:59][INFO] [Cluster Statistical Analysis] Threshold:0.1 Iterations:1000 Debug-seed:-1 Threads:24 Precision:3
[ ][APP][22/03/21-14:34:17][ERROR] Unexpected error
Traceback (most recent call last):
File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/cellphonedb/src/api_endpoints/terminal_api/method_terminal_api_endpoints/method_terminal_commands.py", line 127, in statistical_analysis
LocalMethodLauncher(cpdb_app.create_app(verbose, database)). \
File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/cellphonedb/src/local_launchers/local_method_launcher.py", line 54, in cpdb_statistical_analysis_local_method_launcher
self.cellphonedb_app.method.cpdb_statistical_analysis_launcher(
File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/cellphonedb/src/core/methods/method_launcher.py", line 63, in cpdb_statistical_analysis_launcher
cpdb_statistical_analysis_method.call(meta,
File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/cellphonedb/src/core/methods/cpdb_statistical_analysis_method.py", line 23, in call
cpdb_statistical_analysis_complex_method.call(meta.copy(),
File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/cellphonedb/src/core/methods/cpdb_statistical_analysis_complex_method.py", line 67, in call
meta = meta.loc[counts.columns]
File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/pandas/core/indexing.py", line 879, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/pandas/core/indexing.py", line 1099, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/pandas/core/indexing.py", line 1037, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/pandas/core/indexing.py", line 1315, in _validate_read_indexer
raise KeyError(
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['izi9unx1_8qdzhivu_AAACCCACATGACGTT-1'], dtype='object'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
Sorry there's bug in the previous change. I subsetted the rows rather than columns of the counts. Now fixed. @Victor21v please try again.
Sorry there's bug in the previous change. I subsetted the rows rather than columns of the counts. Now fixed. @Victor21v please try again.
Now it runs perfect, thanks!
I tried running without using the statistical method:
cellphonedb method analysis test_meta.txt test_counts.txt
and got
[ ][APP][23/03/21-11:01:28][WARNING] Latest local available version is `v2.0.0`, using it
[ ][APP][23/03/21-11:01:28][WARNING] User selected downloaded database `v2.0.0` is available, using it
[ ][CORE][23/03/21-11:01:28][INFO] Initializing SqlAlchemy CellPhoneDB Core
[ ][CORE][23/03/21-11:01:28][INFO] Using custom database at /home/jovyan/.cpdb/releases/v2.0.0/cellphone.db
[ ][APP][23/03/21-11:01:28][INFO] Launching Method cpdb_analysis_local_method_launcher
[ ][APP][23/03/21-11:01:28][INFO] Launching Method _set_paths
[ ][APP][23/03/21-11:01:28][WARNING] Output directory (/home/jovyan/projects/cellphonedb/broken/out) exist and is not empty. Result can overwrite old results
[ ][APP][23/03/21-11:01:28][INFO] Launching Method _load_meta_counts
[ ][CORE][23/03/21-11:01:28][INFO] Launching Method cpdb_method_analysis_launcher
[ ][CORE][23/03/21-11:01:28][INFO] Launching Method _counts_validations
[ ][CORE][23/03/21-11:01:28][INFO] [Non Statistical Method] Threshold:0.1 Precision:3
[ ][APP][23/03/21-11:01:28][ERROR] Unexpected error
Traceback (most recent call last):
File "/home/jovyan/projects/cellphonedb/core_method_refactor/cellphonedb/src/api_endpoints/terminal_api/method_terminal_api_endpoints/method_terminal_commands.py", line 195, in analysis
LocalMethodLauncher(cpdb_app.create_app(verbose,
File "/home/jovyan/projects/cellphonedb/core_method_refactor/cellphonedb/src/local_launchers/local_method_launcher.py", line 93, in cpdb_analysis_local_method_launcher
self.cellphonedb_app.method.cpdb_method_analysis_launcher(meta,
File "/home/jovyan/projects/cellphonedb/core_method_refactor/cellphonedb/src/core/methods/method_launcher.py", line 104, in cpdb_method_analysis_launcher
means, significant_means, deconvoluted = cpdb_analysis_method.call(
File "/home/jovyan/projects/cellphonedb/core_method_refactor/cellphonedb/src/core/methods/cpdb_analysis_method.py", line 17, in call
cpdb_analysis_complex_method.call(meta.copy(),
File "/home/jovyan/projects/cellphonedb/core_method_refactor/cellphonedb/src/core/methods/cpdb_analysis_complex_method.py", line 53, in call
clusters = cpdb_statistical_analysis_helper.build_clusters(meta, counts_filtered, complex_composition_filtered)
TypeError: build_clusters() missing 1 required positional argument: 'skip_percent'
Like the exception suggests, build_clusters
is missing skip_percent
argument on line 53 of src/core/methods/cpdb_analysis_complex_method.py
:
clusters = cpdb_statistical_analysis_helper.build_clusters(meta, counts_filtered, complex_composition_filtered)
core_logger.info('Running Real Analysis')
I'm not sure if it needs to be skip_percent=True
or skip_percent=False
...
I haven't looked at the non-statistical part at all. Apparently it shares some of the functions with statistical analysis which I totally overlooked. Thank you for bringing this up! I'll work on that.
Now fixed. @prete could you give it a test?
Now fixed. @prete could you give it a test?
Hi @nh3, looks like it's working now 👍
Hello,
Apologies for the delay.
I've compared the significan_means.txt
between CellPhoneDB==2.1.7b1 and CellPhoneDB==2.1.8b3 using a toy example.
With CellPhoneDB==2.1.8b3, there are more NAs on average, although the increase is very subtle (0.3%). The mean values of non-NAs match 100%.
Could this be due to rounding/number of digits used in the comparison?
Can anyone compare the number of NAs in significan_means.txt
using another example?
Thank you!
I took the test dataset, luzgaral's toy dataset (endometrium) and the dataset from Issue 303 and run them (30, 15 and 15 times) and this is what I got benchmark_v2.1.7_vs_v.2.1.8b3.zip
dataset | NA diff |
---|---|
test | 0.0605% |
issue 303 | 0.0012% |
endometrium | 0.3055% |
I don't know why this version keeps getting that tiny fraction of NA higher, but for what it's worth I think its mergeable.
All changes previously discussed are part of https://github.com/ventolab/CellphoneDB/pull/1
This PR speeds up permutation analysis by re-factoring p value calculations and re-implementing build_clusters(), mean_analysis(), percent_analysis(), build_percent_result(), get_significant_means() and other auxiliary functions. The idea is to 1) remove unnecessary slicing and copying of the counts matrix, 2) move part of p value calculation into each permutation to reduce the mount of data needed to return to the main process by 64x. 3) use fast vectorised calculation in place of for loops. In a local test, ~30x speed up and ~10x memory savings is observed for a dataset with 10k cells and 70+ cell types.