LSSTDESC / rail_attic

Redshift Assessment Infrastructure Layers
MIT License
14 stars 9 forks source link

update files #308

Closed yanzastro closed 1 year ago

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (0228aa7) 100.00% compared to head (884acd9) 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #308 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 38 38 Lines 2507 2507 ========================================= Hits 2507 2507 ``` | Flag | Coverage Δ | | |---|---|---| | unittests | `100.00% <100.00%> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/LSSTDESC/RAIL/pull/308?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC) | Coverage Δ | | |---|---|---| | [src/rail/estimation/algos/somocluSOM.py](https://codecov.io/gh/LSSTDESC/RAIL/pull/308?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC#diff-c3JjL3JhaWwvZXN0aW1hdGlvbi9hbGdvcy9zb21vY2x1U09NLnB5) | `100.00% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

sschmidt23 commented 1 year ago

I ran the demo notebook with the old and updated versions of the functions that you added. Strangely, the CPU time used looks to be about the same old vs updated, but the wall time is reduced for your new function methods by around a factor of four. I'm not sure why the old way had what looks like random hangs built in, but it does look like it's working and the code now runs faster. So, I'll approve once you add a note about the slicing in the one comment in the code, as things look good.

yanzastro commented 1 year ago

Hi @sschmidt23, thanks for reviewing the PR! Basically what I did is to add our own function to calculate the distance between data vector and SOM weight vector because the built-in function in somoclu is slow. Here is the original function in somoclu (see line 556 in https://github.com/peterwittek/somoclu/blob/7c784488c74954d0e24915a00c83a5b2864ff1b8/src/Python/somoclu/train.py): ` def get_surface_state(self, data=None):

    if data is None:
        d = self._data
    else:
        d = data

    codebookReshaped = self.codebook.reshape(self.codebook.shape[0] * self.codebook.shape[1], self.codebook.shape[2])
    parts = np.array_split(d, 200, axis=0)
    am = np.empty((0, (self._n_columns * self._n_rows)), dtype="float64")
    for part in parts:
        am = np.concatenate((am, (cdist((part), codebookReshaped, 'euclidean'))), axis=0)

    if data is None:
        self.activation_map = am
    return am`

The distances are stored in the array am which is initialized as a zero-size array. The distances are calculated chunk by chunk with the for loop. In each iteration, the code simply concatenates the distances to the old am array. This makes the function run pretty slowly.

What I updated is just to define am with the size of the data and assign the distances from each loop to the associated part of am.

Now I also combine the get_surface_state function with get_bmus function to reduce redundancy

yanzastro commented 1 year ago

Hi @sschmidt23, thanks for reviewing the PR! Basically what I did is to add our own function to calculate the distance between data vector and SOM weight vector because the built-in function in somoclu is slow. Here is the original function in somoclu (see line 556 in https://github.com/peterwittek/somoclu/blob/7c784488c74954d0e24915a00c83a5b2864ff1b8/src/Python/somoclu/train.py):

def get_surface_state(self, data=None):

        if data is None:
            d = self._data
        else:
            d = data

        codebookReshaped = self.codebook.reshape(self.codebook.shape[0] * self.codebook.shape[1], self.codebook.shape[2])
        parts = np.array_split(d, 200, axis=0)
        am = np.empty((0, (self._n_columns * self._n_rows)), dtype="float64")
        for part in parts:
            am = np.concatenate((am, (cdist((part), codebookReshaped, 'euclidean'))), axis=0)

        if data is None:
            self.activation_map = am
        return am

The distances are stored in the array am which is initialized as a zero-size array. The distances are calculated chunk by chunk with the for loop. In each iteration, the code simply concatenates the distances to the old am array. This makes the function run pretty slowly.

What I updated is just to define am with the size of the data and assign the distances from each loop to the associated part of am.

Now I also combine the get_surface_state function with get_bmus function to reduce redundancy