Closed yanzastro closed 1 year ago
I ran the demo notebook with the old and updated versions of the functions that you added. Strangely, the CPU time used looks to be about the same old vs updated, but the wall time is reduced for your new function methods by around a factor of four. I'm not sure why the old way had what looks like random hangs built in, but it does look like it's working and the code now runs faster. So, I'll approve once you add a note about the slicing in the one comment in the code, as things look good.
Hi @sschmidt23, thanks for reviewing the PR! Basically what I did is to add our own function to calculate the distance between data vector and SOM weight vector because the built-in function in somoclu
is slow. Here is the original function in somoclu (see line 556 in https://github.com/peterwittek/somoclu/blob/7c784488c74954d0e24915a00c83a5b2864ff1b8/src/Python/somoclu/train.py):
` def get_surface_state(self, data=None):
if data is None:
d = self._data
else:
d = data
codebookReshaped = self.codebook.reshape(self.codebook.shape[0] * self.codebook.shape[1], self.codebook.shape[2])
parts = np.array_split(d, 200, axis=0)
am = np.empty((0, (self._n_columns * self._n_rows)), dtype="float64")
for part in parts:
am = np.concatenate((am, (cdist((part), codebookReshaped, 'euclidean'))), axis=0)
if data is None:
self.activation_map = am
return am`
The distances are stored in the array am
which is initialized as a zero-size array. The distances are calculated chunk by chunk with the for loop. In each iteration, the code simply concatenates the distances to the old am
array. This makes the function run pretty slowly.
What I updated is just to define am
with the size of the data and assign the distances from each loop to the associated part of am
.
Now I also combine the get_surface_state
function with get_bmus
function to reduce redundancy
Hi @sschmidt23, thanks for reviewing the PR! Basically what I did is to add our own function to calculate the distance between data vector and SOM weight vector because the built-in function in somoclu
is slow. Here is the original function in somoclu (see line 556 in https://github.com/peterwittek/somoclu/blob/7c784488c74954d0e24915a00c83a5b2864ff1b8/src/Python/somoclu/train.py):
def get_surface_state(self, data=None):
if data is None:
d = self._data
else:
d = data
codebookReshaped = self.codebook.reshape(self.codebook.shape[0] * self.codebook.shape[1], self.codebook.shape[2])
parts = np.array_split(d, 200, axis=0)
am = np.empty((0, (self._n_columns * self._n_rows)), dtype="float64")
for part in parts:
am = np.concatenate((am, (cdist((part), codebookReshaped, 'euclidean'))), axis=0)
if data is None:
self.activation_map = am
return am
The distances are stored in the array am
which is initialized as a zero-size array. The distances are calculated chunk by chunk with the for loop. In each iteration, the code simply concatenates the distances to the old am
array. This makes the function run pretty slowly.
What I updated is just to define am
with the size of the data and assign the distances from each loop to the associated part of am
.
Now I also combine the get_surface_state
function with get_bmus
function to reduce redundancy
Codecov Report
Patch coverage:
100.00
% and no project coverage change.Additional details and impacted files
```diff @@ Coverage Diff @@ ## main #308 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 38 38 Lines 2507 2507 ========================================= Hits 2507 2507 ``` | Flag | Coverage Δ | | |---|---|---| | unittests | `100.00% <100.00%> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/LSSTDESC/RAIL/pull/308?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC) | Coverage Δ | | |---|---|---| | [src/rail/estimation/algos/somocluSOM.py](https://codecov.io/gh/LSSTDESC/RAIL/pull/308?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC#diff-c3JjL3JhaWwvZXN0aW1hdGlvbi9hbGdvcy9zb21vY2x1U09NLnB5) | `100.00% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC):umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.