Open bachi55 opened 3 years ago
Till now we could reach a speed-up by factor 4 with very simple modifications:
There is still room for improvement:
Top list of run-time consuming parts:
Function | Call Count | Time (ms) | Own Time (ms) | Comment |
---|---|---|---|---|
sqlite3.Connection (execute) | 261257 | 26353 | 26353 (29.6%) | Data loading from the SQLite |
scipy.spatial._distance_wrap.cdist_citiblock | 87952 | 8358 | 8358 (9.4%) | Molecule kernel calculation |
_get_molecule_feature_matrix | 85789 | 14610 | 8167 (9.2%) | Data loading from the SQLite |
numpy.ufunc (reduce) | 363454 | 2834 | 2834 (3.2%) | |
generalized_tanimoto_kernel | 87952 | 30925 | 2597 (2.9%) | Molecule kernel calculation |
str (split) | 16533569 | 2259 | 2259 (2.5%) | |
_get_lambda_delta | 84784 | 34755 | 2024 | |
listcomp | 89728 | 1815 | 1815 | |
sqlite3.Cursor (fetchall) | 171882 | 1741 | 1741 | |
check_array | 176366 | 10176 | 1652 | |
numpy.array | 923289 | 1531 | 1530 | |
_assert_all_finite | 176366 | 5104 | 1483 | |
inner_f | 440877 | 25802 | 1312 | |
cdist | 87952 | 11057 | 897 | |
get_molecule_features_by_molecule_id | 84928 | 41548 | 888 |
We could reach a further improvement by do some optimizations on the kernel computation side:
The biggest speedup, so far though, was reached by computing the preference values for all candidates associated with one sequence (so of all nodes) in one go.
Function | n_A | n_B | d | time (4 cores) | time (1 core) |
---|---|---|---|---|---|
alg_1__numba | 100 | 5000 | 307 | 0.057894 | 0.262477 |
cdist__FAST | 100 | 5000 | 307 | 0.122171 | 0.122416 |
ufunc | 100 | 5000 | 307 | 0.137654 | 0.323284 |
--- | --- | --- | --- | --- | --- |
alg_1__numba | 1000 | 1000 | 307 | 0.313120 | 0.539815 |
cdist__FAST | 1000 | 1000 | 307 | 0.203876 | 0.193676 |
ufunc | 1000 | 1000 | 307 | 0.357854 | 0.593559 |
--- | --- | --- | --- | --- | --- |
alg_1__numba | 100 | 5000 | 7500 | 4.285039 | 9.505218 |
cdist__FAST | 100 | 5000 | 7500 | 3.065871 | 3.094869 |
ufunc | 100 | 5000 | 7500 | 2.244482 | 6.121194 |
--- | --- | --- | --- | --- | --- |
alg_1__numba | 1000 | 1000 | 7500 | 13.003921 | |
cdist__FAST | 1000 | 1000 | 7500 | 5.654260 | |
ufunc | 1000 | 1000 | 7500 | 4.224893 |
We used 4 cores for the parallel computations ('alg_1__numba' and 'ufunc'). Depending on the input data dimension different algorithms perform the best.
For example, when the costs per kernel matrix element increases (d=7500) the ufunc implementation performs the best.
When the output matrix has a rectangular shape, i.e. n_A < n_B, than the 'alg_1__numba' performs the best. This case appears when the preference values are calculated where n_A is the number of active candidates and n_B are the candidates for which the preference values should be calculated.
Function | n_A | n_B | d | time |
---|---|---|---|---|
alg_1__numba | 100 | 100 | 307 | 0.001172 |
cdist__FAST | 100 | 100 | 307 | 0.002104 |
ufunc | 100 | 100 | 307 | 0.008020 |
--- | --- | --- | --- | --- |
alg_1__numba | 100 | 200 | 307 | 0.002162 |
cdist__FAST | 100 | 200 | 307 | 0.003946 |
ufunc | 100 | 200 | 307 | 0.009165 |
--- | --- | --- | --- | --- |
alg_1__numba | 100 | 500 | 307 | 0.005299 |
cdist__FAST | 100 | 500 | 307 | 0.009801 |
ufunc | 100 | 500 | 307 | 0.018691 |
--- | --- | --- | --- | --- |
alg_1__numba | 100 | 1000 | 307 | 0.010685 |
cdist__FAST | 100 | 1000 | 307 | 0.019640 |
ufunc | 100 | 1000 | 307 | 0.030049 |
--- | --- | --- | --- | --- |
alg_1__numba | 100 | 5000 | 307 | 0.052604 |
cdist__FAST | 100 | 5000 | 307 | 0.117722 |
ufunc | 100 | 5000 | 307 | 0.125857 |
--- | --- | --- | --- | --- |
alg_1__numba | 100 | 10000 | 307 | 0.103035 |
cdist__FAST | 100 | 10000 | 307 | 0.234538 |
ufunc | 100 | 10000 | 307 | 0.245509 |
--- | --- | --- | --- | --- |
alg_1__numba | 100 | 100000 | 307 | 1.234449 |
cdist__FAST | 100 | 100000 | 307 | 2.443562 |
ufunc | 100 | 100000 | 307 | 2.576871 |
N = 24, n_cand=50, L_max=15
Numpy: 22.856s Numba: 28.291s
N = 48, n_cand=50, L_max=15
Numpy: 97.751s Numba: 97.639s
N = 96, n_cand=50, L_max=15
Numpy: 321.269s Numba: 292.787s
N = 24, n_cand=50, L_max=25
Numpy: 27.407s Numba: 33.160s
N = 24, n_cand=100, L_max=15
Numpy: 37.855s Numba: 42.473s
... where do we spend most of the run-time?