Do profiling of the fixed-ms2 score SSVM

bachi55 commented 3 years ago

... where do we spend most of the run-time?

bachi55 commented 3 years ago

Till now we could reach a speed-up by factor 4 with very simple modifications:

use SQLite to ensure output order of feature vectors instead of Pandas
faster parsing of the fingerprints strings in the DB while loading

There is still room for improvement:

store fingerprints in some Numpy compatible binary format in the DB

Top list of run-time consuming parts:

Function	Call Count	Time (ms)	Own Time (ms)	Comment
sqlite3.Connection (execute)	261257	26353	26353 (29.6%)	Data loading from the SQLite
scipy.spatial._distance_wrap.cdist_citiblock	87952	8358	8358 (9.4%)	Molecule kernel calculation
_get_molecule_feature_matrix	85789	14610	8167 (9.2%)	Data loading from the SQLite
numpy.ufunc (reduce)	363454	2834	2834 (3.2%)
generalized_tanimoto_kernel	87952	30925	2597 (2.9%)	Molecule kernel calculation
str (split)	16533569	2259	2259 (2.5%)
_get_lambda_delta	84784	34755	2024
listcomp	89728	1815	1815
sqlite3.Cursor (fetchall)	171882	1741	1741
check_array	176366	10176	1652
numpy.array	923289	1531	1530
_assert_all_finite	176366	5104	1483
inner_f	440877	25802	1312
cdist	87952	11057	897
get_molecule_features_by_molecule_id	84928	41548	888

bachi55 commented 3 years ago

We could reach a further improvement by do some optimizations on the kernel computation side:

Using Numba
or, calling the C-wrapper of SciPy directly.

The biggest speedup, so far though, was reached by computing the preference values for all candidates associated with one sequence (so of all nodes) in one go.

bachi55 commented 3 years ago

Function	n_A	n_B	d	time (4 cores)	time (1 core)
alg_1__numba	100	5000	307	0.057894	0.262477
cdist__FAST	100	5000	307	0.122171	0.122416
ufunc	100	5000	307	0.137654	0.323284
---	---	---	---	---	---
alg_1__numba	1000	1000	307	0.313120	0.539815
cdist__FAST	1000	1000	307	0.203876	0.193676
ufunc	1000	1000	307	0.357854	0.593559
---	---	---	---	---	---
alg_1__numba	100	5000	7500	4.285039	9.505218
cdist__FAST	100	5000	7500	3.065871	3.094869
ufunc	100	5000	7500	2.244482	6.121194
---	---	---	---	---	---
alg_1__numba	1000	1000	7500	13.003921
cdist__FAST	1000	1000	7500	5.654260
ufunc	1000	1000	7500	4.224893

We used 4 cores for the parallel computations ('alg_1__numba' and 'ufunc'). Depending on the input data dimension different algorithms perform the best.

For example, when the costs per kernel matrix element increases (d=7500) the ufunc implementation performs the best.

bachi55 commented 3 years ago

When the output matrix has a rectangular shape, i.e. n_A < n_B, than the 'alg_1__numba' performs the best. This case appears when the preference values are calculated where n_A is the number of active candidates and n_B are the candidates for which the preference values should be calculated.

Function	n_A	n_B	d	time
alg_1__numba	100	100	307	0.001172
cdist__FAST	100	100	307	0.002104
ufunc	100	100	307	0.008020
---	---	---	---	---
alg_1__numba	100	200	307	0.002162
cdist__FAST	100	200	307	0.003946
ufunc	100	200	307	0.009165
---	---	---	---	---
alg_1__numba	100	500	307	0.005299
cdist__FAST	100	500	307	0.009801
ufunc	100	500	307	0.018691
---	---	---	---	---
alg_1__numba	100	1000	307	0.010685
cdist__FAST	100	1000	307	0.019640
ufunc	100	1000	307	0.030049
---	---	---	---	---
alg_1__numba	100	5000	307	0.052604
cdist__FAST	100	5000	307	0.117722
ufunc	100	5000	307	0.125857
---	---	---	---	---
alg_1__numba	100	10000	307	0.103035
cdist__FAST	100	10000	307	0.234538
ufunc	100	10000	307	0.245509
---	---	---	---	---
alg_1__numba	100	100000	307	1.234449
cdist__FAST	100	100000	307	2.443562
ufunc	100	100000	307	2.576871

bachi55 commented 3 years ago

N = 24, n_cand=50, L_max=15

Numpy: 22.856s Numba: 28.291s

N = 48, n_cand=50, L_max=15

Numpy: 97.751s Numba: 97.639s

N = 96, n_cand=50, L_max=15

Numpy: 321.269s Numba: 292.787s

N = 24, n_cand=50, L_max=25

Numpy: 27.407s Numba: 33.160s

N = 24, n_cand=100, L_max=15

Numpy: 37.855s Numba: 42.473s

aalto-ics-kepaco / msms_rt_ssvm

Do profiling of the fixed-ms2 score SSVM #10