JustGlowing / minisom

:red_circle: MiniSom is a minimalistic implementation of the Self Organizing Maps
MIT License
1.43k stars 418 forks source link

Large Quantization Error with Spatio-Temporal Data? #187

Open vwgeiser opened 2 months ago

vwgeiser commented 2 months ago

Belongs under Question tag

I am working with some Mean Sea Level Pressure ('pmsl' in the original 4km HRRR NetCDF) data for a region over the midwest and I have a working SOM, but my Quantization Errors (QE) are very large ~67.0 while my Topographic Errors are lower (or at least more typical) ~0.03. Additionally my learning curve doesn't match the typical shape either. I was wondering if anyone has run into this before for SOMs using weather/climate data?

The best idea I have relates to my sample size (only 56 days, and inputs into the SOM), that each are flattened from 420x444 to 'input_len=186480'. Which is probably an oddly shaped SOM? Increasing SOM dimensions reduces QE as expected but since our dataset is quite small this is not the most desirable solution.

I'm still relatively new to SOMs so I was curious if I was missing some known SOM best practices or am overlooking something.

# Scale the data for input into SOM
scaler = MinMaxScaler()
pmsl_data_scaled = scaler.fit_transform(pmsl_data)

# SOM Hyperparemeters
som_map = (3,3)
x,y = som_map
sigma = np.sqrt(x**2 + y**2) 
learning_rate = .3
ngb_function = 'gaussian'
decay_function = 'linear_decay_to_zero'
sigma_decay_function = 'linear_decay_to_one'
init = 'random' #'random' # 'pca'
train = 'batch'
iterations = 2000
topology = 'hexagonal' # could also be retangular 
activation_distance = 'euclidean'

X=np.array(pmsl_data_scaled)
input_len=X.shape[1]

# create som
som = MiniSom(x=x,y=y,input_len=input_len,
                    sigma=sigma,learning_rate=learning_rate,
                    neighborhood_function=ngb_function,
                    topology=topology,
                    decay_function=decay_function,
                    sigma_decay_function=sigma_decay_function,
                    activation_distance=activation_distance,
                    random_seed=64)

PMSL33 PMSL33LC

JustGlowing commented 2 months ago

hi, try using the StandardScaler instead of the MinMaxScaler before anything else.

Other things that you can try:

vwgeiser commented 2 months ago

@JustGlowing The strange shape of the learning curve can be explained by an oversight in my own code, I was running the visualization process from the BasicUsage.ipynb example on an already trained SOM instead of a new instance of one. However, the large QE question still remains. I noticed that in the MovieCovers.ipynb example that there is also quite a large QE, so this could be the real error value for the problem I'm working with.

(The only aspect that has changed about the problem from the last post is I now have a few more samples to work with.)

Using StandardScaler increases QE substantially, is there an interpretive reason as to why standard scaler might be preferred? Here is a side by side comparison with equal hyperparameters:

SSvsMM

Learning curve visualization [with linear_decayto(zero/one)] MSLP33LC MSLP33TE

JustGlowing commented 2 months ago

The standard scaler usually works well as the distance from the codebooks is computed using the euclidean distance by default. Other types of normalization also work, but standardization has worked in most of the cases for me.

Regarding the magnitude of the Quantization Error, the error reflects how close the codebooks are to the data. It heavily depends on the scale of the data and the size of the map. I don't fully understand your data but it seems that the topology makes sense. Unless you care about QE for a particular reason, I wouldn't care too much about the magnitude.

On Sat, Jun 8, 2024, 03:06 Victor Geiser @.***> wrote:

@JustGlowing https://github.com/JustGlowing The strange shape of the learning curve can be explained by an oversight in my own code, I was running the visualization process from the BasicUsage.ipynb example on an already trained SOM instead of a new instance of one. However, the large QE question still remains. I noticed that in the MovieCovers.ipynb example that there is also quite a large QE, so this could be the real error value for the problem I'm working with.

(The only aspect that has changed about the problem from the last post is I now have a few more samples to work with.)

Using StandardScaler increases QE substantially, is there an interpretive reason as to why standard scaler might be preferred? Here is a side by side comparison with equal hyperparameters:

SSvsMM.png (view on web) https://github.com/JustGlowing/minisom/assets/141886393/4b9ec70e-5eb9-46d5-b861-805d7cadd7e4

— Reply to this email directly, view it on GitHub https://github.com/JustGlowing/minisom/issues/187#issuecomment-2155757551, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFTNGKMMGO4LBH5DPTGEBTZGJRLJAVCNFSM6AAAAABI5SCMU6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJVG42TONJVGE . You are receiving this because you were mentioned.Message ID: @.***>

vwgeiser commented 2 months ago

Gotcha, thanks for the explanation! So if I am understanding correctly the high quantization error might have to do with my input size then (combined with the size of the SOM)? If I am originally working with a 420x444 lat/long grid containing pixel values that I flatten down into 'input_len=186480' for input into a SOM then it follows that even small distances between samples might be compounded over the dataset's size leading to higher QE?

# Suppose an average squared difference per feature is around 0.01 (or some small difference due to scaling)

Average Squared Difference = 0.01×186480=1864.80
Quantization Error = sqrt(1864.80) ≈ 43.2

Here is a quick example of the SOM output when I don't scale the data: PMSL33 PMSL33LC

[ 200 / 200 ] 100% - 0:00:00 left quantization error: 157049.84784462157 SOM training took 4.58 seconds! Begin Learning Curve Visualization End Learning Curve Visualization Q-error: 157887.617 T-error: 0.013 SOM LC visualization took 201.26 seconds!

An averaged composite of all samples included within each node (a sanity check): PMSL33AVE

JustGlowing commented 2 months ago

You are indeed working with a high number of features and that is likely causing your QE to be high. This is called "the curse of dimensionality".

If you have a way to select a subset of features or compress them with a dimensionality reduction technique, it can help in reducing the error.

On Sat, Jun 8, 2024, 22:14 Victor Geiser @.***> wrote:

Gotcha, thanks for the explanation! So if I am understanding correctly the high quantization error might have to do with my input size then (combined with the size of the SOM)? If I am originally working with a 420x444 lat/long grid containing pixel values that I flatten down into 'input_len=186480' for input into a SOM then it follows that even small distances between samples might be compounded over the dataset's size leading to higher QE?

Suppose an average squared difference per feature is around 0.01 (or some small difference due to scaling)

Average Squared Difference = 0.01×186480=1864.80 Quantization Error = sqrt(1864.80) ≈ 43.2

— Reply to this email directly, view it on GitHub https://github.com/JustGlowing/minisom/issues/187#issuecomment-2156186284, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFTNGPCFK6TTCCOC7YFM4LZGNX3LAVCNFSM6AAAAABI5SCMU6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJWGE4DMMRYGQ . You are receiving this because you were mentioned.Message ID: @.***>

vwgeiser commented 2 months ago

@JustGlowing Right now the input length is 186480 as I have 1 variable that is flattened into one numpy array (with 149 samples this becomes a vector of length 186480*149. What would be the best way to add another variable on top of this? In the documentation for "train" it states that the input data can be a np.array or list Data matrix. How does this work when initializing the SOM with input_len when the data spans multiple rows? Is the way I would incorporate another variable to flatten the new variable and append it onto the first, making the input length 186480 + 186480 = 372960? It would seem more logical to add it as another column in the input variables, but then my question would again be how would this work with input_len; as one row doesn't correspond to one sample but 186480 rows (and 2 or 3 or more columns) would still correspond to with one input into the SOM.

I am looking for functionality similar to the R "supersom" package. Is that something MiniSOM could support naturally?

JustGlowing commented 2 months ago

Hi, from what I understand you have an input matrix with 149 rows and 186480 columns. This means that you have 149 samples and 186480 variables. Even if you are reshaping objects that in other domains are considered variables, for a SOM the columns are considered variables and the rows samples.

From what I understand, you want to add more variables to your input and it's an easy task. You just need to add more columns to your matrix and set input_len equal to the number of columns that you have.

vwgeiser commented 2 months ago

@JustGlowing I worded this in a weird way, I apologize. Each variable of the data is spatial and has and X and Z to it. If I were to put it in the format in your previous comment it would have 3 columns (so an input length of 3?). I've tried to implement this in MiniSom given my understanding of this problem and came up with the following discussion:

# | Pressure | Temperature | Humidity  | 
| [420x444] | [420x444] | [420x444] |  (sample 1)
| [420x444] | [420x444] | [420x444] |  (sample 2)
...
| [420x444] | [420x444] | [420x444] |  (sample 149)

this organization has a shape of (149, 3, 420, 444).

However when this is input into minisom:

ValueError: could not broadcast input array from shape (3,420,444) into shape (3,)
# | Pressure | Temperature | Humidity  |
| [186480] | [186480] | [186480] |  (sample 1)
| [186480] | [186480] | [186480] |  (sample 2)
...
| [186480] | [186480] | [186480] |  (sample 149)

This output with the all data being flattened would have a shape of (149, 3, 186480).

Yields the same result when put into Minisom:

ValueError: could not broadcast input array from shape (3,186480) into shape (3,)

When I do this I run into the following errors relating to the shape of the input data, hence why I was looking for a way to represent this in minisom. From the Readme it doesn't seem this is currently supported by MiniSom, but I was wondering if this has ever been encountered by others in the past?

If input_len could instead accept an input_shape this could work to represent this sort of multivariable spatial data in a structure compatible for MiniSom, however this wouldn't be such a simple change for the rest of the package.

vwgeiser commented 2 months ago

The only way I can think of to represent this in minisom would be to append the pressure, temperature, and humidity values together into one vector of length 186480 + 186480 + 186480 = 559440. that way this would consider all three variables and I could reshape portions of this vector for the production of a final visualization similar to above?

I.e the first 186480 values within the weights of a SOM node would correspond to pressure, the next 186480 temperature, and the next 186480 humidity?

JustGlowing commented 2 months ago

hi again @vwgeiser, MiniSom only accepts square matrices as input and your last intuition makes sense.

If you find that the dimensionality the problem becomes an issue, you can train a different SOMs for each type of input and then find a way to aggregate the results.

vwgeiser commented 2 months ago

The only way I can think of to represent this in minisom would be to append the pressure, temperature, and humidity values together into one vector of length 186480 + 186480 + 186480 = 559440. that way this would consider all three variables and I could reshape portions of this vector for the production of a final visualization similar to above?

I.e the first 186480 values within the weights of a SOM node would correspond to pressure, the next 186480 temperature, and the next 186480 humidity?

For anyone finding this thread later, vectorization is the process I was looking for and is what I currently am working with for a solution!