VNNikolaidis / nnlib2Rcpp

An R package for Neural Nets created using nnlib2

Other

13 stars 4 forks source link

Checking the codebook vectors in trained vector quantization NN #16

Closed drag05 closed 4 months ago

drag05 commented 4 months ago

Using version 0.2.4 of nnlib2Rcpp, I am trying to identify the codebook vectors after training the *LVQ. Taking the sLVQ example in Manual, the show function returns this portion:

ID: 14
Type: Connection_Set
Aux.Param: 0
SourceCom: 12
DestinCom: 13
ListSize(elements): 12
0: CON FR: 0 TO: 0 WGT: 0.18191
1: CON FR: 1 TO: 0 WGT: 0.582824
2: CON FR: 2 TO: 0 WGT: 0.0797798
3: CON FR: 3 TO: 0 WGT: 0.0616723
4: CON FR: 0 TO: 1 WGT: 0.408429
5: CON FR: 1 TO: 1 WGT: 0.381401
6: CON FR: 2 TO: 1 WGT: 0.44876
7: CON FR: 3 TO: 1 WGT: 0.41235
8: CON FR: 0 TO: 2 WGT: 0.572357
9: CON FR: 1 TO: 2 WGT: 0.433802
10: CON FR: 2 TO: 2 WGT: 0.727807
11: CON FR: 3 TO: 2 WGT: 0.826277
Component: Output

Question1: are rows 0:3 the components of codebook vector with iris_desired_class_ids = 0`? Question2: if yes on Q1, is there any way of inputting the desired number of codebook vectors?

Please advise, thank you!

VNNikolaidis commented 4 months ago

About Question1: _are rows 0:3 the components of codebook vector with iris_desired_classids = 0`?

Yes, the shown connection set effectively maintains (as connection weights) vectors that will be used to assign data to a class. So, the weights in rows 0:3, 4:7 and 8:11 each correspond to a class and, during recall, the model effectively does 1-NNC, comparing the data against these vectors.

About Question2: if yes on Q1, is there any way of inputting the desired number of codebook vectors?

In terms of the currently provided “ready-to-use” sLVQ, the answer is no. Having several vectors correspond to each class is indeed useful, desirable, more general and applicable to many problems. However, the current sLVQ and other “ready-to-use” models in this package were meant to be simple versions of the corresponding family of NNs. I.e. they were created to show how the underlying C++ library can be used to define and create one’s own NN models. So, in the current sLVQ only a single vector is maintained for each class. I can provide suggestions for the implementation of the model you describe (via the underlying C++ library and/or employment of the package’s ‘NN’ module) but they would involve a little coding. Let me know of more specifics if you need to discuss this further.

drag05 commented 4 months ago

@VNNikolaidis I am far from adept at using C++ and I would appreciate you taking a little time to show this code to me at your convenience. Thank you!

VNNikolaidis commented 4 months ago

(This is a response to your comment and continuing on the original Question2: if yes on Q1, is there any way of inputting the desired number of codebook vectors?)

I must admit that some time has passed since I touched these example NNs (and a little more time has passed since I used any LVQ, SOM or similar NN in actual data problems). So, I had to refresh my memory on how these were implemented (the C++ code for both LVQ NN types (supervised and unsupervised) is in file lvq_nn.cpp (also here). C++ is the fastest way (in terms of runtime performance) to play with the underlying library.

I can help with that. But to do so, we must agree on the model that will be implemented, i.e the topology and the functionality of the NN components (what they will do during encoding and recalling).

If you have a specific model in mind let me know. Otherwise, just throwing ideas out of my head, I would add redundant output nodes (for example 3 per class). This would have the effect of having 3 prototype vectors per class. The only problem to implement would be how to select the connections than need adjusting during encoding, but nothing new there. So the question is: would this be the model you have in mind, or something different?

VNNikolaidis commented 4 months ago

If (and only if) you really want to get deeper on this, read further. Below I attempt to show more on how the CURRENT implementation of the (simple) supervised LVQ operates.

When it recalls a data vector, the connections pass the difference between the input data vector coordinate value and the corresponding class prototype coordinate value. These are fed to the output layer nodes, and each output-layer node uses them to produce (as output) the Euclidian distance of the vector to the class prototype vectors. It is effectively 1-NNC.

The winning node is picked as the one with smallest output (i.e. distance).

During encode, a data vector is recalled (as above) and the winner is picked. If it represents the desired class, it raises a “reward” flag for that node, otherwise a “punish” flag is placed there. These flags are then used to adjust connection weights.

(Unfortunately, reading the above description I just wrote makes the process look much more complicated than it really is, but it is the best I can do.)

To reveal a little more about the details (some are still hidden in the C++ code for LVQ nodes and LVQ connections (defined here)), I re-implemented a supervised LVQ for the same problem (“learning” iris_s), but this time I used the “NN” module. I had to use a very ugly (and not very R-style) nested loop to show how encoding is done:

# Example similar to that of help(LVQs) implemented via 'NN':

library(nnlib2Rcpp)
rm(list = ls())

# LVQ expects data in 0 to 1 range, so scale some numeric data...
iris_s <- as.matrix(iris[1:4])
c_min <- apply(iris_s, 2, FUN = "min")
c_max <- apply(iris_s, 2, FUN = "max")
c_rng <- c_max - c_min
iris_s <- sweep(iris_s, 2, FUN = "-", c_min)
iris_s <- sweep(iris_s, 2, FUN = "/", c_rng)

# create a vector of desired class ids:
iris_desired_class_ids <- as.integer(iris$Species)

input_length  <- 4
output_length <- 3

#-----------------------------------------------------------------------------
# implement a supervised LVQ using NN module:

LVQ_PUNISH_PE <- 10 # just a definition used in the C++ LVQ code
LVQ_DEACTI_PE <- 20 # just a definition used in the C++ LVQ code
LVQ_REWARD_PE <- 30 # just a definition used in the C++ LVQ code
LVQ_RND_MIN <-    0 # just a definition used in the C++ LVQ code
LVQ_RND_MAX <-   +1 # just a definition used in the C++ LVQ code

# create a typical LVQ topology for this problem:

n <- new('NN')
n$add_layer('pass-through', input_length)
n$add_connection_set('LVQ')
n$add_layer('LVQ-output', output_length)
n$create_connections_in_sets(LVQ_RND_MIN, LVQ_RND_MAX)

# the ugly (nested loop) encoding code:

for (epoch in 1:5)
    for (i in 1:nrow(iris_s))
    {
        # first recall a single data vector:

        n$input_at(1, iris_s[i, ])
        n$recall_all_fwd()

        # find which 'class' is recalled (the one with smallest distance)
        current_winner_pe <- which.min(n$get_output_at(3))

        # now select if the correct class was recalled (and reward)
        # or an incorrect (and punish):

        new_output_flags <- rep(LVQ_DEACTI_PE, output_length)
        new_output_flags[current_winner_pe] <- LVQ_PUNISH_PE
        if (current_winner_pe == iris_desired_class_ids[i])
            new_output_flags[current_winner_pe] <- LVQ_REWARD_PE
        n$set_misc_values_at(3, new_output_flags)

        n$encode_at(2)
    }

# done encoding.

# recall all data:

lvq_recalled_class_ids <-
    apply(n$recall_dataset(iris_s, 1, 3, TRUE), 1, which.min)

plot(iris_s, pch = lvq_recalled_class_ids, main = "LVQ recalled clusters (module)")

correct <- lvq_recalled_class_ids == iris_desired_class_ids
cat("Correct:", sum(correct))

P.S. Of course, we are making thinks harder than they are needed to be. These simple LVQ models can easily be interpreted using simpler approaches (s.a. a matrix to store weights and some simple calculations to adjust them).

drag05 commented 4 months ago

@VNNikolaidis

A) The result lists 1 codebook vector per class.

Curiously, increasing the output_length by 1 unit created a new codebook vector for an nonexistent class - shown in the "Petal" plane only - while the output remained identical (accuracy and mapping). lengthout4

output_length <- 4  # increase by one unit from default value

# the output layer

ID: 17
Type: Connection_Set
Aux.Param: 0
SourceCom: 16
DestinCom: 18
ListSize(elements): 16
0: CON FR: 0 TO: 0 WGT: 0.181909
1: CON FR: 1 TO: 0 WGT: 0.582822
2: CON FR: 2 TO: 0 WGT: 0.0797797
3: CON FR: 3 TO: 0 WGT: 0.0616732
4: CON FR: 0 TO: 1 WGT: 0.408428
5: CON FR: 1 TO: 1 WGT: 0.381394
6: CON FR: 2 TO: 1 WGT: 0.44877
7: CON FR: 3 TO: 1 WGT: 0.412359
8: CON FR: 0 TO: 2 WGT: 0.572363
9: CON FR: 1 TO: 2 WGT: 0.433801
10: CON FR: 2 TO: 2 WGT: 0.727809
11: CON FR: 3 TO: 2 WGT: 0.826278
12: CON FR: 0 TO: 3 WGT: 0.483242
13: CON FR: 1 TO: 3 WGT: 0.047457
14: CON FR: 2 TO: 3 WGT: 0.882042
15: CON FR: 3 TO: 3 WGT: -0.0840498
Component: LVQ-output

> correct <- lvq_recalled_class_ids == iris_desired_class_ids
Correct: 136

B) "If you have a specific model in mind let me know": Years ago I was using the kohonen package for GIS analyses and I remember the algorithm was a bit slow with large datasets. Thank you!

VNNikolaidis commented 4 months ago

@drag05

I think what you see is expected.

In the current model, each output layer node (I call them PE for Processing Element) implicitly represents a class. That is why the connections incoming to each output node are so important in assigning a data vector to the corresponding class. So only the weights linked to nodes corresponding to existing classes are usually adjusted.

Now, about the data vector assigned to the “nonexistent” class you added. I believe it just accidentally happened to be closer to the weights stored for that “nonexistent" class. These weights probably have undergone little change from their initial (random) values; they would be only adjusted if (during encoding) a data point was erroneously assigned to this "nonexistent" class. Then the weights would be adjusted via a “punish” mechanism. But this probably stopped happening very early in the encoding process. However these weight values may still be within the data’s range and happen to be the closer ones to some of the data points; in which case these data points are assigned to that class, existing or “nonexistent”.

I did a version of the code where I keep and display the weights at each iteration, and you get things like this (or other similar spaghetti-like images). I added on "nonexistent" class. It gets more colorful if you add several:

Rplot02

The red line at the bottom is the vector for the "nonexistent" class moving away from the data. Now if it had not moved far enough, some data points may be assigned to it in the final classification.

Back to the original topic now.

I quickly created a variation where several output nodes correspond to each training data class. I believe this may be close to what you originally suggested. We now have several groups of weights -aka codebook vectors (?)- employed for each class. Btw, this is controlled by the number_of_output_pes_per_class variable in the code below:

# Example similar to that of help(LVQs) implemented via 'NN':

library(nnlib2Rcpp)
rm(list = ls())

# LVQ expects data in 0 to 1 range, so scale some numeric data...

iris_s <- as.matrix(iris[1:4])
c_min <- apply(iris_s, 2, FUN = "min")
c_max <- apply(iris_s, 2, FUN = "max")
c_rng <- c_max - c_min
iris_s <- sweep(iris_s, 2, FUN = "-", c_min)
iris_s <- sweep(iris_s, 2, FUN = "/", c_rng)

# create a vector of desired class ids:
desired_class_ids <- as.integer(iris$Species)

# defined just to make names more general (independent from iris):

data              <- iris_s
input_length      <- ncol(data)
number_of_classes <- length(unique(desired_class_ids))

# how many nodes will be implicitly assigned for each class,i.e. how many groups
# of connections/prototype-vectors/codebook-vectors(?) will be used per class:

number_of_output_pes_per_class <- 4

# output layer will be expanded to accommodate multiple PEs per class:

output_layer_size <-
  number_of_classes * number_of_output_pes_per_class

#-----------------------------------------------------------------------------
# implement a supervised LVQ using NN module:

LVQ_PUNISH_PE <- 10   # just a definition used in the C++ LVQ code
LVQ_DEACTI_PE <- 20   # just a definition used in the C++ LVQ code
LVQ_REWARD_PE <- 30   # just a definition used in the C++ LVQ code
LVQ_RND_MIN <-    0   # just a definition used in the C++ LVQ code
LVQ_RND_MAX <-   +1   # just a definition used in the C++ LVQ code

# create a typical LVQ topology for this problem:

n <- new('NN')
n$add_layer('pass-through', input_length)
n$add_connection_set('LVQ')
n$add_layer('LVQ-output', output_layer_size)
n$create_connections_in_sets(LVQ_RND_MIN, LVQ_RND_MAX)

# the ugly (nested loop) encoding code:

for (epoch in 1:5)
  for (i in 1:nrow(data))
  {
    # recall a data vector:

    n$input_at(1, data[i, ])
    n$recall_all_fwd()

    # find which output node is best for input vector (has smallest distance)
    current_winner_pe <- which.min(n$get_output_at(3))

    # translate winning node to class id:
    returned_class <-
      ceiling(current_winner_pe / number_of_output_pes_per_class)

    # now check if the correct class was recalled (and reward)
    # or an incorrect (and punish):

    new_output_flags <- rep(LVQ_DEACTI_PE, output_layer_size)
    new_output_flags[current_winner_pe] <- LVQ_PUNISH_PE
    if (returned_class == desired_class_ids[i])
      new_output_flags[current_winner_pe] <- LVQ_REWARD_PE
    n$set_misc_values_at(3, new_output_flags)

    n$encode_at(2)
  }

# done encoding.

# recall all data:

lvq_recalled_winning_nodes <-
  apply(n$recall_dataset(data, 1, 3, TRUE), 1, which.min)

# translate winning node to class id:
lvq_recalled_class_ids <-
  ceiling(lvq_recalled_winning_nodes / number_of_output_pes_per_class)

plot(data, pch = lvq_recalled_class_ids, main = "LVQ recalled clusters (module)")

correct <- lvq_recalled_class_ids == desired_class_ids
cat("Correct:", sum(correct), "\n")
cat("Number of produced classes:", length(unique(lvq_recalled_class_ids)), "\n")

Depending on the random initial weights, this may learn the data with less data vectors misclassified. I run it several times (with number_of_output_pes_per_class <- 4 ) and sometimes had 145/150 reported as correct (vs the usual 136/150 in the original version. But I also got a couple of runs where 133/150 correct were reported.

Now the family of NNs that Kohonen talked about had many variations. In cases like these one could also add a mechanism where a neighborhood of nodes is adjusted (instead of just one, I have implemented a such mechanism in the unsupervised LVQ version of this package).

Some closing notes:

I would not expect this package’s NNs to be particularly fast in big data sets (especially if R for-loops are employed as I did above). I always hoped that some user of the package (and underlying C++ library) would implement runtime optimizations, hardware parallelism and the like, but this has not happened yet. However, for a little runtime speed benefit, I may translate the above code to supervised LVQ's C++ code in the package. It seems useful enough. But I will wait for your suggestions first, as other things are also pressuring me and time is a bit limited.

I enjoy this discussion since I used to use NN models in the past much more than I have a chance to do these days. However, if you don’t mind, let’s close this “issue” and continue the conversation via email (it is listed in my GitHub page). Btw, while there, please star the package repository and don’t forget to cite it if you use it in any work 😊.

VNNikolaidis commented 4 months ago

@drag05 On second thought, let's close this when there is a C++ implementation.

drag05 commented 4 months ago

@VNNikolaidis I leave the closing of this issue at your discretion. Now (and I don't mean to be presumptuous even if I may seem so at times):

Regarding the "ugly" loop: I wonder if parallelizing it under foreach, %dopar% , %:% framework would be possible and - if yes - help increase the execution speed for large datasets. Parallelization could be an option formal argument in encode i.e.
```
encode(data, desired_class_ids, training_epochs, parallel = "no")
```
since foreach works both sequential and parallel;
Weight visualization: I think is a marvelous idea to have the possibility of watching weight trajectories during training (with start/end points also identified). This could lead in to better understanding of the process, give ideas for further optimization and probably ways of testing;
Noticed the question mark next to "codebook" vectors: I am using the nomenclature found in the literature although I find the name "codebook" quite uninformative as in my interpretation the vectors are a summary (more or less detailed - depending on their number) of the (training) data, in a way similar (say) to Principal Components. I think LVQ/SOM is a good tool not only for predictive modeling but also for descriptive modeling;

Regarding the comment:

# create a vector of desired class ids (**starting from 0**):
desired_class_ids <- as.integer(iris$Species)

At R level the desired class ids start from one. Probably a new re-indexing takes place at C++ level.

> desired_class_ids
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[40] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[79] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[118] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Actually, if:

desired_class_id = as.integer(iris$Species) - 1L

then, this happens:

> cat("Correct:", sum(correct), "\n")
Correct: 50 
> cat("Number of produced classes:", length(unique(lvq_recalled_class_ids)), "\n")
Number of produced classes: 1

with:

> desired_class_ids
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [40] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [79] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[118] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

And finally, the output layer shows 12 codebook vectors. I wonder if it is possible that the new vectors to be mapped to the existing classes (0:2) and not new classes (3:11) when calling show Thank you!

VNNikolaidis commented 4 months ago

A quick reply for now:

(a) What you say about parallelization techniques may be right. But we never got to try parallelization for this package, although I have used it in other contexts. And since this is a package on CRAN, changes should be made with caution, maintain backwards compatibility etc. Any help is welcomed. But will look into your suggestions.

(b) Regarding the comment:

# create a vector of desired class ids (**starting from 0**):

this comment was a leftover (which I overlooked) from the original LVQs documentation example which directly uses the C++ LVQ code. So it does not apply when we use the NN module. The NN module lives (mostly) n R (but uses C++ NN components) and thus indexes start at 1 (it "translates" them if needed). If you look at the documentation example for LVQs, ids are subtracted by 1. In the example discussed here, they are not. I just forgot to delete the comment. Ignore the "starting from 0" part of the comment. That is why things went wild when you subtracted 1 from the indexes.

[update: I removed the 'starting from 0' part of the comment in the code samples above]

(c) Regarding the question mark next to "codebook" vectors, ignore it as well. I just wanted to be cautious, having left the LVQ scene many years now, I was not sure about the terminology.

(d) Weight visualization. Well, the NN module allows you to get the weights (and display them if you like), set the weights and much more. It is pretty versatile for educational or prototyping purposes. If you want, I can paste the code I used for that plot, but you probably can do it yourself. The trick was a n$get_weights_at(2) , 2 indicating the 2nd NN component, i.e. the set of connections since this NN consists of a layer of input nodes (component 1) a connection set (component 2) and a layer of output nodes (component 3).

What you did not tell me is about the multiple output nodes per class. Is that what you were trying to do?

drag05 commented 4 months ago

@VNNikolaidis A) I am not sure foreach would create backward compatibility other that the fact that, being explicit-type parallelization, it would require the data to be split in blocks and each block sent to a worker (CPU core). This would create a different situation in reference to the sequential approach since the codebook vectors created on any data block learn only about the data inside that block, with no knowledge about data in other blocks.

I call these codebook vectors "local vectors", as they are specialized on a subset (block) of initial data.

Since the split of initial data will be stratified by class id (so that each block receives data from all present classes) the end result of learning will be that each class will get at least as many local vectors as as there are workers, or more (depends on setup).

Another option would be to use all obtained local vectors as a new, dimension-reduced input dataset and re-encode them (sequentially maybe) and create a new set of (fewer) codebook vectors for prediction purposes.

I would consider these codebook vectors "global" as they now contain information about the entire initial dataset.

Frankly, I have no idea of what effect parallelization would have on prediction accuracy.

If I find the time, I'll try parallelization of current encode method on an external dataset and compare the sequential and parallel runs, just to get an idea about execution time and performance.

B) Yes, I am trying to gain some control on the number of codebook vectors. That seems to be a good approach.

Thank you!

VNNikolaidis commented 4 months ago

It is an interesting problem what you discuss in A above. It reminds me of techniques used for handling big-data problems where the dataset is split to smaller more manageable subsets. Anyway, for any parallelization, the question is whether the time-benefits will be greater than the overhead. This is probably why the trend is GPU processing.

(Of course, if you try any of these ideas I would be interested to learn the outcome.)

There is also the possibility of enabling some form of parallelism in the underlying C++ library, at NN component level (i.e. for layer and connection_set classes) as these (by design) process data in a way that is easily parallelized.

With all these comments, I may have missed something you asked, if so let me know.

drag05 commented 4 months ago

@VNNikolaidis

I don't think that there is something you would have missed. I have started on a script that tests foreach on parallelization done outside the nnlib2Rcpp just to figure out performance and time. I am currently at the data processing level.

If you are interested please find below an example of stratified data split for normalized iris data and the associated Species encoded (I have put all data normalization steps in a custom function for convenience).

There is, I think, a salient point here: the data can be normalized before or after split with obvious implications regarding maxima and minima. I will test both options but I wonder which makes sense from ML point of view - maybe both.

There are as many CPU cores (workers) as there are classes of Species and as many data blocks. Each data block contains data from all three Species .

Here are the iris_s list of after-split normalized matrices (just the heads):

> lapply(norm_ll, head, 9)
$`1`
      Sepal.Length Sepal.Width Petal.Length Petal.Width
 [1,]      0.18750   0.6315789   0.03508772  0.04166667
 [2,]      0.06250   0.4736842   0.01754386  0.04166667
 [3,]      0.28125   0.8421053   0.08771930  0.12500000
 [4,]      0.15625   0.5789474   0.05263158  0.04166667
 [5,]      0.12500   0.4210526   0.05263158  0.00000000
 [6,]      0.18750   0.7894737   0.05263158  0.08333333
 [7,]      0.18750   0.5263158   0.08771930  0.16666667
 [8,]      0.21875   0.6315789   0.05263158  0.04166667
 [9,]      0.31250   1.0000000   0.03508772  0.04166667

$`2`
      Sepal.Length Sepal.Width Petal.Length Petal.Width
 [1,]   0.16666667   0.3636364   0.05357143  0.04166667
 [2,]   0.19444444   0.6363636   0.05357143  0.04166667
 [3,]   0.08333333   0.5454545   0.05357143  0.08333333
 [4,]   0.13888889   0.5454545   0.08928571  0.04166667
 [5,]   0.00000000   0.3636364   0.00000000  0.00000000
 [6,]   0.41666667   0.8181818   0.01785714  0.04166667
 [7,]   0.38888889   1.0000000   0.07142857  0.12500000
 [8,]   0.30555556   0.7727273   0.03571429  0.12500000
 [9,]   0.22222222   0.6818182   0.07142857  0.12500000

$`3`
      Sepal.Length Sepal.Width Petal.Length Petal.Width
 [1,]      0.06250   0.5238095   0.08928571  0.04347826
 [2,]      0.00000   0.4285714   0.07142857  0.04347826
 [3,]      0.31250   0.8095238   0.08928571  0.04347826
 [4,]      0.12500   0.4761905   0.07142857  0.00000000
 [5,]      0.21875   0.7142857   0.07142857  0.08695652
 [6,]      0.40625   0.8571429   0.12500000  0.08695652
 [7,]      0.31250   0.6666667   0.12500000  0.04347826
 [8,]      0.06250   0.7619048   0.00000000  0.04347826
 [9,]      0.18750   0.4761905   0.10714286  0.04347826

and here is the list of coded species associated with each data block. As you can see, the lengths are not all equal and for now, worker 1 receives less work:

> ids
$`1`
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

$`2`
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

$`3`
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Thank you!

drag05 commented 4 months ago

@VNNikolaidis

These are the first outputs: below are the recall indexes from the 3 parallel workers and the network structure. The only reason I can think of regarding the lengths is that one worker had 54 rows of data and the others may have recycled to longest length.

> recall(lvq) (transposed for readability)
      result.1 result.2 result.3
 [1,]        0        0        0
 [2,]        0        0        0
 [3,]        0        0        0
 [4,]        0        0        0
 [5,]        0        0        0
 [6,]        0        0        0
 [7,]        0        0        0
 [8,]        0        0        0
 [9,]        0        0        0
[10,]        0        0        0
[11,]        0        0        0
[12,]        0        0        0
[13,]        0        0        0
[14,]        0        1        0
[15,]        0        1        0
[16,]        0        1        0
[17,]        0        1        0
[18,]        2        2        0
[19,]        1        2        0
[20,]        1        2        0
[21,]        1        1        2
[22,]        1        1        1
[23,]        2        1        1
[24,]        1        2        1
[25,]        1        2        1
[26,]        1        2        1
[27,]        1        1        1
[28,]        1        1        1
[29,]        1        1        1
[30,]        1        1        1
[31,]        1        1        1
[32,]        1        2        1
[33,]        1        2        1
[34,]        1        2        1
[35,]        2        2        1
[36,]        2        2        2
[37,]        2        2        2
[38,]        2        2        2
[39,]        2        2        2
[40,]        2        2        1
[41,]        2        2        2
[42,]        2        2        2
[43,]        2        2        2
[44,]        1        2        2
[45,]        2        2        2
[46,]        2        2        2
[47,]        2        2        2
[48,]        2        0        2
[49,]        2        0        2
[50,]        0        0        2
[51,]        0        0        2
[52,]        0        0        2
[53,]        0        0        2
[54,]        0        0        2

Below is the network structure (a bit garbled) obtained from the log file. Please let me know if you can make sense of it and re-arrange it so that I can make sense of it. Is there a way to extract the codebook vectors from this? Also - it seems - one can only save the trained network worker-by-worker and not as a whole. Probably having this parallelization implemented at C++ level would offer better choices.

starting worker pid=21024 on localhost:11927 at 11:54:24.871
starting worker pid=3008 on localhost:11927 at 11:54:25.040
starting worker pid=11852 on localhost:11927 at 11:54:25.208
LVQ created, now encode data (or load NN from file).
LVQ created, now encode data (or load NN from file).
LVQ created, now encode data (or load NN from file).
Setting up LVQ for 0 to 2 ids (3 classes).
LVQ will be trained for 3 classes.
Setting up LVQ for 0 to 2 ids (3 classes).
LVQ will be trained for 3 classes.
Setting up LVQ for 0 to 2 ids (3 classes).
LVQ will be trained for 3 classes.
Training Finished.
Training Finished.
Training Finished.
Learning Vector Quantizer NN (Class LVQs):
------Network structure (BEGIN)--------
Component: Kohonen_LVQ
ID: 11
Type: Artificial_Neural_System
Aux.Param: 0
Input_Dim: 4
OutputDim: 3
NumCompon: 3
Component: Input
ID: 12
Type: Layer
Aux.Param: 0
VectSize(elements): 4
0: PE B: 0 M: 0
1: PE B: 0 M: 0
L: rning Vector Quantizer NN (Class LVQs):
------Network structure (BEGIN)--------
:
Component: rk structure (BEGIN)--------
CoKohonen_LVQ
honen_LVQ0D:  M: 

Type: Artificial_Neural_SystemPE B: ficial_Neural_System0ux.Param: Aux.Param:  M: 0
Input_Dim: _(Fully_Con4ected)
4
OutputDim: 3
: 
tputDim: 3umCompon: 
pe: NumCompon: SetCo3ponent: Inputaram: 
0
Component: 
: Input
urceCom: ID: 
pe: 12Desti
om: 13Aux.Param: La0er
ListSize(elements): VectSize(elements): 420

0VectSize(elements): : CPE
 B:  00:  M:  PE B: 
GT: 1: : 0.129407

1B: 1: CON  B:  1
2TO: 0M: : PEGT:  B0.6696292
: 2M: PE B: 0ON M:  2 0 T
 0B: 3: GT: PE0485513 B: 

3 M: 0omponent: 
Fully_Connected)Component: _(Fully_Connected)
ID:  14pe: 
nnection_Set0.0682172Connection_
t

Aux.Param:  FR: 120 T
 DestinCom:  WGT: Com: 12
380789DeListSize(5lements): 12: 
CON FR: ize(elements): 12
N TO: 0FR: 0:  TO:  CON0FR: 002
6TO:  :0.162242CONT: 
1FR: 0.194281:
 TO: 1ON1: R:  WGT: CON TO: 324 FR: 0
1  TO: : 0CON915 WGT: 
0.507271
: 2TO: CON: FR: CONT: 2 FR: 365
8WGT: : 0353983
C W3FR: : 078212 TO: 
3ON: GT:  FR: 3.679995 TO: 
09 WGT::0CONT: 638
0458251 FR: 4
 : : 2 CON WG FR FR: 0.512281 TO: 10 WGT: 0.40.389204 FR: 
2
5: TO: CO2 FR WGT: 1FR:  TO: 8511TO:  WGT:  WGT: 0.3865740.2
922 FR: 6
36  TO: 2ON WGT:CON20.897691
 TO:  0.404473t: 
GT: 
0ID: 059
137ON
: R: 3ype: Lay T FR
3Aux.Param:  TO:  01.389325

8ectSize(elements): 0.
5906CON
R: 08: O: CON WGT FR: 0.584021
:  TO: 20 WGT: 1ON FR: 732: 1
PE B:  0  WG M:  FR: 204960631
 TO: 20 W: : PE0.322347
R:  B: 10 : : 2M: CON FR:  
0.717274Network structure (END)--------

O: 11: GT: 0.814711
R: 11: O: CON FR:  0.788293
O: Component: Output
809766
ID: Component: 13
tput
Type: Layer
Aux.Param: Type: Layer
Aux.Param: ements): 30

0: ctSize(elements): PE B: 00:  M: PE B: 
01: : PE B: returned 0 M: PElusters with ids:  B:  
02 M:   PE

 0:  M: PE30 B: 
0--------Network structure (END)--------
 M: 30
--------Network structure (END)--------
Lvq returned 3 clusters with ids: Lvq returned  23 1clusters with ids: 
 2 1

This is the code chunk used for parallelization:

# set cluster
    cl = parallel::makeCluster(workers, outfile = "log.txt")
       doParallel::registerDoParallel(cl)

tt = foreach(i = 1:workers, .combine = 'rbind', .packages = 'nnlib2Rcpp') %dopar% {
                            lvq = new("LVQs") 
                            lvq$encode(normll[[i]], idds[[i]], 50)
                            show(lvq)
                            lvq$recall(normll[[i]])
                }
 stopCluster(cl)

where:

normll[[i]] is the i-th block of normailzed data as matrix
idds[[i]] is the i-th component associated Species class vector coded 0:2. When I started, I had Species coded 1:3 but the network would add class 0 and look for 4 classes instead of three. Hence, I had to code 0:2 at R level to obtain 3 classes 0:2.

Thank you!

drag05 commented 4 months ago

The weird length of recall indices was due to the callback 'rbind' used earlier in foreach. The lengths were recycled to fit the array dims. I have replaced it with 'c'. I have also increased the number of epochs to 500:

The input ids:

> in_ids
   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [47] 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
 [93] 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
[139] 2 2 2 2 2 2 2 2 2 2 2 2

The output (recalled) ids:

> out_ids
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [47] 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 2 2
 [93] 2 2 2 2 2 2 1 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 2 1 2 1 1 2 2 1 2 2 1 2 1 2 1 1 1 2 2 1 2 2 2
[139] 2 2 2 2 2 2 2 2 2 2 2 2

Now the lengths match

The performance is:

> table(in_ids == out_ids)

FALSE  TRUE 
   13   137

VNNikolaidis commented 4 months ago

@drag05 Lots of information in these last comments, I will try to contribute something to it. I find what you tried (i.e. parallelization based on foreach and splitting the data) very interesting and possibly useful for larger data problems (not iris).

Αbout your comment “the data can be normalized before or after split with obvious implications regarding maxima and minima”, I would probably select “before” since: (a) if your goal is to use the "local vectors" (as you call them) to synthesize something of more global meaning, then having a single set of min, max and -especially- standard deviation values would simplify the process, as all “local vectors” would be compatible. Furthermore (b), if you plan to use these vectors to classify new data and had used multiple sets of min, max, standard deviation, which would you use to scale that data to make it compatible with your model? It is possible but more complex as well, and why make things complex?

About your comment “Below is the network structure (a bit garbled) obtained from the log file. Please let me know if you can make sense of it and re-arrange it so that I can make sense of it.”. No, I think all parallel workers put text at the log file simultaneously and the result is unreadable and of no use.

About your comment “Also - it seems - one can only save the trained network worker-by-worker and not as a whole”: I find this normal; each worker has created its own new LVQ. And, unfortunately, (although not surprisingly) I did try to create them in a list (outside the foreach) but this failed do to the cursed ‘external pointer’ issue (I wonder if there is a way to overcome that).

I had some other ideas but not having not played parallel and doParallel much, could you save me some time and tell me how you extracted the classification ids from the workers (for example, how did you get the results you show in recall(lvq) (transposed for readability)? I was not able to replicate that in my (I admit very quick) tests.

P.S. If you don’t mind, at some point (not yet) I will close the "issue". When this is done, we can continue this discussion via e-mail. Regards.

drag05 commented 4 months ago

@VNNikolaidis

I do not have your email address and couldn't find any on your ORCID either. Replying to github posts directly by email could be a solution although I would prefer using email directly.

Any way, until we figure this out, there were some changes that happened since my last post which I need to clarify:

the tt object to which the output of foreach has been assigned contains the recalled ids.

new foreach formal arguments being added seem to help print the network structure more neatly:

tt = foreach(i = 1:workers
      , .combine = 'c'
      , .packages = 'nnlib2Rcpp' 
      , .multicombine = TRUE                    # <-- new argument value
      , .maxcombine = 5e5L) %dopar% {           # <-- new argument value
                        lvq = new("LVQs") 
                        # train
                        lvq$encode(normll[[i]], idds[[i]], 100)
                        show(lvq)
                        # test
                        lvq$recall(normll[[i]])
            }
stopCluster(cl)
stopImplicitCluster()

The network structure appears to be clearer when the number of epochs is large. Below is the full output for 1000 epochs. As can be seen, there are 3 output layers one for each worker with clearly shown local vectors:


# workers (distinct R processes)
starting worker pid=11840 on localhost: <hidden port> at 13:42:04.133
starting worker pid=9856 on localhost:<hidden port> at 13:42:04.136
starting worker pid=10860 on localhost:<hidden port> at 13:42:04.137

network

LVQ created, now encode data (or load NN from file). LVQ created, now encode data (or load NN from file). LVQ created, now encode data (or load NN from file). Setting up LVQ for 0 to 2 ids (Setting up LVQ for 0 to 2 3 classes). ids (3 classes). LVQ will be trained for LVQ will be trained for 3 classes. classes. Setting up LVQ for 0 to 2 ids (3 classes). LVQ will be trained for 3 classes. Training Finished. Learning Vector Quantizer NN (Class LVQs): ------Network structure (BEGIN)-------- Component: Kohonen_LVQ ID: 11 Type: Artificial_Neural_System Aux.Param: 0 InputDim: 4 OutputDim: 3 NumCompon: 3 Component: Input ID: 12 Type: Layer Aux.Param: 0 VectSize(elements): 4 0: PE B: 0 M: 0 1: PE B: 0 M: 0 2: PE B: 0 M: 0 3: PE B: 0 M: 0 Component: (Fully_Connected) ID: 14 Type: Connection_Set Aux.Param: 0 SourceCom: 12 DestinCom: 13 ListSize(elements): 12 0: CON FR: 0 TO: 0 WGT: 0.18554 1: CON FR: 1 TO: 0 WGT: 0.711207 2: CON FR: 2 TO: 0 WGT: 0.0508146 3: CON FR: 3 TO: 0 WGT: 0.0178742 4: CON FR: 0 TO: 1 WGT: 0.39504 5: CON FR: 1 TO: 1 WGT: 0.397828 6: CON FR: 2 TO: 1 WGT: 0.346925 7: CON FR: 3 TO: 1 WGT: 0.327827 8: CON FR: 0 TO: 2 WGT: 0.700888 9: CON FR: 1 TO: 2 WGT: 0.534542 10: CON FR: 2 TO: 2 WGT: 0.804835 11: CON FR: 3 TO: 2 WGT: 0.875498 Component: Output ID: 13 Type: Layer Aux.Param: 0 VectSize(elements): 3 0: PE B: 0 M: 20 1: PE B: 0 M: 20 2: PE B: 0 M: 30 --------Network structure (END)-------- Lvq returned 3 clusters with ids: 0 2 1 Training Finished. Learning Vector Quantizer NN (Class LVQs): ------Network structure (BEGIN)-------- Component: Kohonen_LVQ ID: 11 Type: Artificial_Neural_System Aux.Param: 0 InputDim: 4 OutputDim: 3 NumCompon: 3 Component: Input ID: 12 Type: Layer Aux.Param: 0 VectSize(elements): 4 0: PE B: 0 M: 0 1: PE B: 0 M: 0 2: PE B: 0 M: 0 3: PE B: 0 M: 0 Component: (Fully_Connected) ID: 14 Type: Connection_Set Aux.Param: 0 SourceCom: 12 DestinCom: 13 ListSize(elements): 12 0: CON FR: 0 TO: 0 WGT: 0.179848 1: CON FR: 1 TO: 0 WGT: 0.595971 2: CON FR: 2 TO: 0 WGT: 0.0582949 3: CON FR: 3 TO: 0 WGT: 0.0832171 4: CON FR: 0 TO: 1 WGT: 0.40475 5: CON FR: 1 TO: 1 WGT: 0.334086 6: CON FR: 2 TO: 1 WGT: 0.524886 7: CON FR: 3 TO: 1 WGT: 0.489111 8: CON FR: 0 TO: 2 WGT: 0.639938 9: CON FR: 1 TO: 2 WGT: 0.398816 10: CON FR: 2 TO: 2 WGT: 0.748187 11: CON FR: 3 TO: 2 WGT: 0.843839 Component: Output ID: 13 Type: Layer Aux.Param: 0 VectSize(elements): 3 0: PE B: 0 M: 20 1: PE B: 0 M: 20 2: PE B: 0 M: 30 --------Network structure (END)-------- Lvq returned 3 clusters with ids: 0 2 1 Training Finished. Learning Vector Quantizer NN (Class LVQs): ------Network structure (BEGIN)-------- Component: Kohonen_LVQ ID: 11 Type: Artificial_Neural_System Aux.Param: 0 InputDim: 4 OutputDim: 3 NumCompon: 3 Component: Input ID: 12 Type: Layer Aux.Param: 0 VectSize(elements): 4 0: PE B: 0 M: 0 1: PE B: 0 M: 0 2: PE B: 0 M: 0 3: PE B: 0 M: 0 Component: (Fully_Connected) ID: 14 Type: Connection_Set Aux.Param: 0 SourceCom: 12 DestinCom: 13 ListSize(elements): 12 0: CON FR: 0 TO: 0 WGT: 0.187101 1: CON FR: 1 TO: 0 WGT: 0.641769 2: CON FR: 2 TO: 0 WGT: 0.082418 3: CON FR: 3 TO: 0 WGT: 0.0562861 4: CON FR: 0 TO: 1 WGT: 0.413665 5: CON FR: 1 TO: 1 WGT: 0.312246 6: CON FR: 2 TO: 1 WGT: 0.512741 7: CON FR: 3 TO: 1 WGT: 0.44086 8: CON FR: 0 TO: 2 WGT: 0.60331 9: CON FR: 1 TO: 2 WGT: 0.432903 10: CON FR: 2 TO: 2 WGT: 0.773317 11: CON FR: 3 TO: 2 WGT: 0.792565 Component: Output ID: 13 Type: Layer Aux.Param: 0 VectSize(elements): 3 0: PE B: 0 M: 20 1: PE B: 0 M: 20 2: PE B: 0 M: 30 --------Network structure (END)-------- Lvq returned 3 clusters with ids: 0 2 1

The network performance - with normalization _after_ split - on recall is:

FALSE TRUE 13 137

Where the recall classes are:

table(tt) 0 1 2 50 41 59
I will implement normalization _before_ split.

Thank you!

drag05 commented 4 months ago

@VNNikolaidis

The following comparison implements data encoding before split.

I am now focusing on sequential vs. parallel execution time, disregarding prediction performance for reasons which, if you are interested, we can discuss later as they are rooted in the structure of the data used.

I have a data table of over 1 million cases containing a target variable with two labels which I take as "ground truth" as they were deterministically derived. The data is quite unbalanced with respect to label presence, as shown below:

total cases
1048575

label distribution
label   rows
    0  856889
    1  191686

So, if I were to guess label 0 every time, the error would be:

guess error (false positive rate)
 0.224

Any fit predictive model should be better than this.

The execution time: I have trained the network for 1000 epochs in each case: 1 worker (equivalent to sequential), 2 and 3 workers for parallel. I consider the load balance of workers acceptable, as shown below for parallel cases:

train the network
training epochs = 1000

1 worker (sequential)

worker load

     worker
label      1
    0       856889
    1       191686

time to run sequential

> system.time(source("script.R"))

 user  system elapsed 
 2.10    0.28    1055.30 

2 workers (parallel)

worker load

     worker
label      1      2
    0 428530 428359
    1  96041  95645

time to run parallel (2)

> system.time(source("script.R"))

 user  system elapsed 
 2.21    0.29    565.54 

3 workers (parallel)

worker load

     worker
label      1      2         3
    0 285554 285649 285686
    1  63664  63928  64094

time to run parallel (3)

> system.time(source("script.R"))

  user  system elapsed 
  1.95    0.24    407.20

The overhead chatter starts feeling present as more than two workers are used. However, the parallel solution helps with the execution time on large datasets.

Every recall in above cases returned one cluster of label 0 only (the network is underfit and is just guessing). Nevertheless, the codebook vectors show non-zero weights in each case.

As I mentioned, the structure of the data may be the cause for underfitting. Increasing the number of codebook vectors could help - I think.

Regarding the iris data, the encoding before split improved recall performance compared with encoding after split implemented before. In this case:

tt
FALSE    TRUE
9           141

but the performance is impacted by worker load which fluctuates from run-to-run due to the small size of the dataset (small sample --> high prediction variance).

Thank you!

VNNikolaidis commented 4 months ago

My turn for “tunnel vision”, not noticing that tt =… Ok then, here is a – quick and not particularly elegant – idea, just to examine (view) the 3 lvqs:

# set cluster
cl = parallel::makeCluster(workers, outfile = "log.txt")
doParallel::registerDoParallel(cl)

tt = foreach(i = 1:workers,
             .combine = 'rbind',
             .packages = 'nnlib2Rcpp') %dopar% {
               lvq = new("LVQs")
               lvq$encode(normll[[i]], idds[[i]], 50)
               list(capture.output(show(lvq)),
                    lvq$recall(normll[[i]]))
             }
stopCluster(cl)

so then you can do things like:

for (i in 1:workers)
{
  # net printout (vector of character objects, one per line):
  cat( tt[[i, 1]], sep = "\n" )

  # print recall ids:
  print(tt[[i, 2]])
}

However, it would be easy to add a method for directly getting/setting weights from/to LVQs objects (similar to that of NN objects), but I would need a little time as other things are pressuring me right now.

You are right, my email is not shown on GitHub as I thought it did. For later reference, some contact info, ORCHID etc. is in the package DESCRIPTION file (and listed on CRAN as it requires this). I would use a different private email address for further correspondence (which prefer not to mention publicly here), but you can start from there when this issue is closed.

drag05 commented 4 months ago

@VNNikolaidis OK, got it.

VNNikolaidis commented 4 months ago

@drag05 Sorry I missed one of your yesterday's comments. I am reading it now and will comment when I get back on my computer.

I also need to write this note before I forget it: as I remember, the unsupervised LVQ has an internal iteration (epoch?) counter that reduces allowable changes to the weights (codevectors) as training proceeds. This is by design (I was following a book's description of LVQ/SOM when implementing it). I believe the maximum number of epochs where allowable changes reach 0 was pretty high (probably 10000). Remind me to double check the code if this mechanism is also supposed to affect LVQ in supervised mode as well.

drag05 commented 4 months ago

@VNNikolaidis It should have a decreasing learning rate, in the shape of

initial value * (1 - (epoch / number of epochs))

I will close this issue now and next reply will be to your email address. Thank you!