Teichlab / cellphonedb

MIT License
338 stars 105 forks source link

Speed up permutation analysis #288

Closed nh3 closed 2 years ago

nh3 commented 3 years ago

This PR speeds up permutation analysis by re-factoring p value calculations and re-implementing build_clusters(), mean_analysis(), percent_analysis(), build_percent_result(), get_significant_means() and other auxiliary functions. The idea is to 1) remove unnecessary slicing and copying of the counts matrix, 2) move part of p value calculation into each permutation to reduce the mount of data needed to return to the main process by 64x. 3) use fast vectorised calculation in place of for loops. In a local test, ~30x speed up and ~10x memory savings is observed for a dataset with 10k cells and 70+ cell types.

nh3 commented 3 years ago

Hi @zktuong @prete would you mind giving this a test?

prete commented 3 years ago

Hi @nh3 I noticed that cpdb_statistical_analysis_helper.py now imports numpy_groupies. Does this import require package numpy-groupies? if so requirements.txt/win-requirements.txt needs to be updated as well

zktuong commented 3 years ago

Is it possible to incorporate some change on PR #275 cellphonedb/src/core/methods/cpdb_statistical_analysis_complex_method.py to speed up the counts_validation step?

Basically it stops converting the numpy arrays to np.float32 twice which causes issues #282.

I can just delete PR #275 after that.

zktuong commented 3 years ago

and also the type conversion in cellphonedb/src/core/methods/method_launcher.py

prete commented 3 years ago

I'm going to publish this as-is as a CellPhoneDB 2.1.8 beta version 1 so it can passed around for beta testing before merging this to master.

Please feel free to share that with users who can test this for us.

EDIT: refer to latter comment

nh3 commented 3 years ago

Thanks @prete and @zktuong !

nh3 commented 3 years ago

.to_numpy() seems to cause shuffle_meta() not actually shuffling in spawned processes. have to revert back to using .values.

nh3 commented 3 years ago

I'm going to publish this as-is as a CellPhoneDB 2.1.8 beta version 1 so it can passed around for beta testing before merging this to master.

conda create -n cpdb-test python=3.8
conda activate cpdb-test
pip install -U CellPhoneDB==2.1.8b1

Please feel free to share that with users who can test this for us.

Hi @prete , could you make commit 73447f7 as a beta release instead please? It reverts to use .values for shuffle_meta() and included a fix to make sure results are correct.

Many thanks!

prete commented 3 years ago

Hi @prete , could you make commit 73447f7 as a beta release instead please? It reverts to use .values for shuffle_meta() and included a fix to make sure results are correct.

CellPhoneDB==2.1.8b1 was build based on d034551. So it does use .values, but it doesn't have meta = meta.loc[counts.columns]. You need both, right? If so, I'll delete release 2.1.8b1 and re-publish it.

nh3 commented 3 years ago

CellPhoneDB==2.1.8b1 was build based on d034551. So it does use .values, but it doesn't have meta = meta.loc[counts.columns]. You need both, right? If so, I'll delete release 2.1.8b1 and re-publish it.

Yes, please. Many thanks!

prete commented 3 years ago

@nh3 can't overwrite an already published version, so for testing purposes please refer to 2.1.8 beta version 2

conda create -n cpdb-test python=3.8
conda activate cpdb-test
pip install -U CellPhoneDB==2.1.8b2
Victor21v commented 3 years ago

Hi,

I want to point out that when testing either 2.1.8b2 or 2.1.8b1, I get the following error when running analysis using as input meta data and h5ad file:

[ ][APP][20/03/21-17:54:19][WARNING] Latest local available version is v2.0.0, using it [ ][APP][20/03/21-17:54:19][WARNING] User selected downloaded database v2.0.0 is available, using it [ ][CORE][20/03/21-17:54:19][INFO] Initializing SqlAlchemy CellPhoneDB Core [ ][CORE][20/03/21-17:54:19][INFO] Using custom database at /home/devel/tin/.cpdb/releases/v2.0.0/cellphone.db [ ][APP][20/03/21-17:54:19][INFO] Launching Method cpdb_statistical_analysis_local_method_launcher [ ][APP][20/03/21-17:54:19][INFO] Launching Method _set_paths [ ][APP][20/03/21-17:54:19][INFO] Launching Method _load_meta_counts [ ][CORE][20/03/21-17:56:15][INFO] Launching Method cpdb_statistical_analysis_launcher [ ][CORE][20/03/21-17:56:15][INFO] Launching Method _counts_validations [ ][APP][20/03/21-17:56:28][ERROR] Invalid Counts data: Some cells in meta didnt exist in counts columns. Maybe incorrect file format.

This does not happen on the version 2.1.7 with the same input and the analysis runs perfect on that version

zktuong commented 3 years ago

Hi,

I want to point out that when testing either 2.1.8b2 or 2.1.8b1, I get the following error when running analysis using as input meta data and h5ad file:

[ ][APP][20/03/21-17:54:19][WARNING] Latest local available version is v2.0.0, using it [ ][APP][20/03/21-17:54:19][WARNING] User selected downloaded database v2.0.0 is available, using it [ ][CORE][20/03/21-17:54:19][INFO] Initializing SqlAlchemy CellPhoneDB Core [ ][CORE][20/03/21-17:54:19][INFO] Using custom database at /home/devel/tin/.cpdb/releases/v2.0.0/cellphone.db [ ][APP][20/03/21-17:54:19][INFO] Launching Method cpdb_statistical_analysis_local_method_launcher [ ][APP][20/03/21-17:54:19][INFO] Launching Method _set_paths [ ][APP][20/03/21-17:54:19][INFO] Launching Method _load_meta_counts [ ][CORE][20/03/21-17:56:15][INFO] Launching Method cpdb_statistical_analysis_launcher [ ][CORE][20/03/21-17:56:15][INFO] Launching Method _counts_validations [ ][APP][20/03/21-17:56:28][ERROR] Invalid Counts data: Some cells in meta didnt exist in counts columns. Maybe incorrect file format.

This does not happen on the version 2.1.7 with the same input and the analysis runs perfect on that version

I can't reproduce this error.

The error message is coming from:

        meta.index = meta.index.astype(str)
        if np.any(~counts.columns.isin(meta.index)):
            raise ParseCountsException('Some cells in meta didnt exist in counts columns',
                                       'Maybe incorrect file format')

which means there's a mismatch in the rownames for your metadata file (which should be a two column file with no row numbers; cell barcodes are in the first column), and the cell barcodes in the columns of the expression matrix (or indices in the .obs slot in your h5ad file).

nh3 commented 3 years ago

Thank you @Victor21v and @zktuong. There is an unintentional change of logic in that code that eluded my attention. It used to require cells in meta exist in counts, however it has now become requiring cells in counts exist in meta. I guess there is value in allowing meta to contain a subset of cells in counts, i.e. testing on only a subset of data, as it's much cheaper/more flexible to change meta than counts. I've restored the previous behaviour.

Victor21v commented 3 years ago

Thank you @Victor21v and @zktuong. There is an unintentional change of logic in that code that eluded my attention. It used to require cells in meta exist in counts, however it has now become requiring cells in counts exist in meta. I guess there is value in allowing meta to contain a subset of cells in counts, i.e. testing on only a subset of data, as it's much cheaper/more flexible to change meta than counts. I've restored the previous behaviour.

Thanks @nh3

Now I have another error when starting the analysis:

[ ][APP][22/03/21-14:31:41][WARNING] Latest local available version is `v2.0.0`, using it
[ ][APP][22/03/21-14:31:41][WARNING] User selected downloaded database `v2.0.0` is available, using it
[ ][CORE][22/03/21-14:31:41][INFO] Initializing SqlAlchemy CellPhoneDB Core
[ ][CORE][22/03/21-14:31:41][INFO] Using custom database at /home/devel/tin/.cpdb/releases/v2.0.0/cellphone.db
[ ][APP][22/03/21-14:31:41][INFO] Launching Method cpdb_statistical_analysis_local_method_launcher
[ ][APP][22/03/21-14:31:41][INFO] Launching Method _set_paths
[ ][APP][22/03/21-14:31:41][INFO] Launching Method _load_meta_counts
[ ][CORE][22/03/21-14:33:34][INFO] Launching Method cpdb_statistical_analysis_launcher
[ ][CORE][22/03/21-14:33:34][INFO] Launching Method _counts_validations
[ ][CORE][22/03/21-14:33:59][INFO] [Cluster Statistical Analysis] Threshold:0.1 Iterations:1000 Debug-seed:-1 Threads:24 Precision:3
[ ][APP][22/03/21-14:34:17][ERROR] Unexpected error
Traceback (most recent call last):
  File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/cellphonedb/src/api_endpoints/terminal_api/method_terminal_api_endpoints/method_terminal_commands.py", line 127, in statistical_analysis
    LocalMethodLauncher(cpdb_app.create_app(verbose, database)). \
  File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/cellphonedb/src/local_launchers/local_method_launcher.py", line 54, in cpdb_statistical_analysis_local_method_launcher
    self.cellphonedb_app.method.cpdb_statistical_analysis_launcher(
  File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/cellphonedb/src/core/methods/method_launcher.py", line 63, in cpdb_statistical_analysis_launcher
    cpdb_statistical_analysis_method.call(meta,
  File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/cellphonedb/src/core/methods/cpdb_statistical_analysis_method.py", line 23, in call
    cpdb_statistical_analysis_complex_method.call(meta.copy(),
  File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/cellphonedb/src/core/methods/cpdb_statistical_analysis_complex_method.py", line 67, in call
    meta = meta.loc[counts.columns]
  File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/pandas/core/indexing.py", line 879, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/pandas/core/indexing.py", line 1099, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/pandas/core/indexing.py", line 1037, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
  File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/home/devel/tin/miniconda3/envs/cellphone_v2_h5ad/lib/python3.8/site-packages/pandas/core/indexing.py", line 1315, in _validate_read_indexer
    raise KeyError(
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['izi9unx1_8qdzhivu_AAACCCACATGACGTT-1'], dtype='object'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
nh3 commented 3 years ago

Sorry there's bug in the previous change. I subsetted the rows rather than columns of the counts. Now fixed. @Victor21v please try again.

Victor21v commented 3 years ago

Sorry there's bug in the previous change. I subsetted the rows rather than columns of the counts. Now fixed. @Victor21v please try again.

Now it runs perfect, thanks!

prete commented 3 years ago

I tried running without using the statistical method:

cellphonedb method analysis test_meta.txt test_counts.txt 

and got

[ ][APP][23/03/21-11:01:28][WARNING] Latest local available version is `v2.0.0`, using it
[ ][APP][23/03/21-11:01:28][WARNING] User selected downloaded database `v2.0.0` is available, using it
[ ][CORE][23/03/21-11:01:28][INFO] Initializing SqlAlchemy CellPhoneDB Core
[ ][CORE][23/03/21-11:01:28][INFO] Using custom database at /home/jovyan/.cpdb/releases/v2.0.0/cellphone.db
[ ][APP][23/03/21-11:01:28][INFO] Launching Method cpdb_analysis_local_method_launcher
[ ][APP][23/03/21-11:01:28][INFO] Launching Method _set_paths
[ ][APP][23/03/21-11:01:28][WARNING] Output directory (/home/jovyan/projects/cellphonedb/broken/out) exist and is not empty. Result can overwrite old results
[ ][APP][23/03/21-11:01:28][INFO] Launching Method _load_meta_counts
[ ][CORE][23/03/21-11:01:28][INFO] Launching Method cpdb_method_analysis_launcher
[ ][CORE][23/03/21-11:01:28][INFO] Launching Method _counts_validations
[ ][CORE][23/03/21-11:01:28][INFO] [Non Statistical Method] Threshold:0.1 Precision:3
[ ][APP][23/03/21-11:01:28][ERROR] Unexpected error
Traceback (most recent call last):
  File "/home/jovyan/projects/cellphonedb/core_method_refactor/cellphonedb/src/api_endpoints/terminal_api/method_terminal_api_endpoints/method_terminal_commands.py", line 195, in analysis
    LocalMethodLauncher(cpdb_app.create_app(verbose,
  File "/home/jovyan/projects/cellphonedb/core_method_refactor/cellphonedb/src/local_launchers/local_method_launcher.py", line 93, in cpdb_analysis_local_method_launcher
    self.cellphonedb_app.method.cpdb_method_analysis_launcher(meta,
  File "/home/jovyan/projects/cellphonedb/core_method_refactor/cellphonedb/src/core/methods/method_launcher.py", line 104, in cpdb_method_analysis_launcher
    means, significant_means, deconvoluted = cpdb_analysis_method.call(
  File "/home/jovyan/projects/cellphonedb/core_method_refactor/cellphonedb/src/core/methods/cpdb_analysis_method.py", line 17, in call
    cpdb_analysis_complex_method.call(meta.copy(),
  File "/home/jovyan/projects/cellphonedb/core_method_refactor/cellphonedb/src/core/methods/cpdb_analysis_complex_method.py", line 53, in call
    clusters = cpdb_statistical_analysis_helper.build_clusters(meta, counts_filtered, complex_composition_filtered)
TypeError: build_clusters() missing 1 required positional argument: 'skip_percent'

Like the exception suggests, build_clusters is missing skip_percent argument on line 53 of src/core/methods/cpdb_analysis_complex_method.py:

    clusters = cpdb_statistical_analysis_helper.build_clusters(meta, counts_filtered, complex_composition_filtered)
    core_logger.info('Running Real Analysis')

I'm not sure if it needs to be skip_percent=True or skip_percent=False...

nh3 commented 3 years ago

I haven't looked at the non-statistical part at all. Apparently it shares some of the functions with statistical analysis which I totally overlooked. Thank you for bringing this up! I'll work on that.

nh3 commented 3 years ago

Now fixed. @prete could you give it a test?

prete commented 3 years ago

Now fixed. @prete could you give it a test?

Hi @nh3, looks like it's working now 👍

luzgaral commented 3 years ago

Hello,

Apologies for the delay.

I've compared the significan_means.txt between CellPhoneDB==2.1.7b1 and CellPhoneDB==2.1.8b3 using a toy example. With CellPhoneDB==2.1.8b3, there are more NAs on average, although the increase is very subtle (0.3%). The mean values of non-NAs match 100%.

Could this be due to rounding/number of digits used in the comparison? Can anyone compare the number of NAs in significan_means.txt using another example?

Thank you!

prete commented 3 years ago

I took the test dataset, luzgaral's toy dataset (endometrium) and the dataset from Issue 303 and run them (30, 15 and 15 times) and this is what I got benchmark_v2.1.7_vs_v.2.1.8b3.zip

dataset NA diff
test 0.0605%
issue 303 0.0012%
endometrium 0.3055%

I don't know why this version keeps getting that tiny fraction of NA higher, but for what it's worth I think its mergeable.

prete commented 2 years ago

All changes previously discussed are part of https://github.com/ventolab/CellphoneDB/pull/1