InTRA-USP / IntraSOM

Self-Organizing Maps library of the InTRA research center.
Other
28 stars 9 forks source link

Dying kernel #13

Open akol67 opened 1 year ago

akol67 commented 1 year ago

I discovered a magic Number, by trial and error... Let mapsize = (ncol, nlin)

If (ncol x nlin)/64 > 8 the kernel dies...

64= number of cpus

rafaelgioria commented 1 year ago

If (ncol x nlin)/64 > 8 the kernel dies... 64= number of cpus Hmmm, curious let me try it.

@akol67 just to keep context, could you refer to the problem by linking message describing it in another issue or describe it here.

akol67 commented 1 year ago

The jupyter notebook kernel dies when you choose an inappropriate parameterization. In this case, the size of the map, in case it is too big

Em qua., 30 de ago. de 2023 às 14:34, rafaelgioria @.***> escreveu:

@akol67 https://github.com/akol67 just to keep context, could you refer to the problem by linking message describing it in another issue or describe it here.

— Reply to this email directly, view it on GitHub https://github.com/InTRA-USP/IntraSOM/issues/13#issuecomment-1699584822, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3MX6RBJAPZF54CDH5ZMVDXX52UDANCNFSM6AAAAAA4E3NGZY . You are receiving this because you were mentioned.Message ID: @.***>

-- Att Alexandre Kolisnyk

rafaelgioria commented 1 year ago

I have got a dying kernel for a huge map 150 x 150 in colab. I have tried 100 x 100 too, and it died too in colab.

the error was running out of available RAM.

So, I think it is not dependent on the number of cores. the limit is the RAM to initialize the SOM training.

akol67 commented 1 year ago

good test, you reproduced the error. Here I have 256Gb in the VDI. However, I have a way to run in a bigger one.. By the way, if you don't specify the map size do you think the internal calculation is good enough?

Em qua., 30 de ago. de 2023 às 15:10, rafaelgioria @.***> escreveu:

I have got a dying kernel for a huge map 150 x 150 in colab. I have tried 100 x 100 too, and it died too in colab.

the error was running out of available RAM.

So, I think it is not dependent on the number of cores. the limit is the RAM to initialize the SOM training.

— Reply to this email directly, view it on GitHub https://github.com/InTRA-USP/IntraSOM/issues/13#issuecomment-1699628986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3MX6TAI5GUI3K7JM77BVDXX562ZANCNFSM6AAAAAA4E3NGZY . You are receiving this because you were mentioned.Message ID: @.***>

-- Att Alexandre Kolisnyk

rafaelgioria commented 1 year ago

image

This snap is for 70 by 70 mapsize. Note a RAM peak for the initialization of SOM training. Data set is the animals of the tutorial.

akol67 commented 1 year ago

I ran without specifying the mapsize and it flew. Internally calculated size must be beyond RAM capacity

Em qua., 30 de ago. de 2023 às 15:17, Alexandre Kolisnyk < @.***> escreveu:

good test, reproduced the error. Here I have 256Gb in the VDI. However, I have a way to run in a bigger one.. By the way, if you don't specify the map size do you think the internal calculation is good enough?

Em qua., 30 de ago. de 2023 às 15:10, rafaelgioria < @.***> escreveu:

I have got a dying kernel for a huge map 150 x 150 in colab. I have tried 100 x 100 too, and it died too in colab.

the error was running out of available RAM.

So, I think it is not dependent on the number of cores. the limit is the RAM to initialize the SOM training.

— Reply to this email directly, view it on GitHub https://github.com/InTRA-USP/IntraSOM/issues/13#issuecomment-1699628986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3MX6TAI5GUI3K7JM77BVDXX562ZANCNFSM6AAAAAA4E3NGZY . You are receiving this because you were mentioned.Message ID: @.***>

-- Att Alexandre Kolisnyk

-- Att Alexandre Kolisnyk

rafaelgioria commented 1 year ago

automatic map size is compute using Vesanto et al (2000) heuristic relation: total number neuron is 5*sqrt(N), with N being the number of samples.

This should not be huge. It is around 27 x 27 mapsize for 20000 samples.

akol67 commented 1 year ago

According my rule of thumb...27 x 27 / 64 = 11 ( >> 8 ) This is too high for my available RAM (256Gb) Kernel couldn't take it

Em qua., 30 de ago. de 2023 às 15:39, rafaelgioria @.***> escreveu:

automatic map size is compute using Vesanto et al (2000) heuristic relation: total number neuron is 5*sqrt(N), with N being the number of samples.

This should not be huge. It is around 27 x 27 mapsize for 20000 samples.

— Reply to this email directly, view it on GitHub https://github.com/InTRA-USP/IntraSOM/issues/13#issuecomment-1699662961, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3MX6SRYFJQ7YEUUW4CZATXX6CGLANCNFSM6AAAAAA4E3NGZY . You are receiving this because you were mentioned.Message ID: @.***>

-- Att Alexandre Kolisnyk

akol67 commented 1 year ago

Testing with a 1.4Tb machine. mapsize = (30,24)

It´s running...

Em qua., 30 de ago. de 2023 às 15:39, rafaelgioria @.***> escreveu:

automatic map size is compute using Vesanto et al (2000) heuristic relation: total number neuron is 5*sqrt(N), with N being the number of samples.

This should not be huge. It is around 27 x 27 mapsize for 20000 samples.

— Reply to this email directly, view it on GitHub https://github.com/InTRA-USP/IntraSOM/issues/13#issuecomment-1699662961, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3MX6SRYFJQ7YEUUW4CZATXX6CGLANCNFSM6AAAAAA4E3NGZY . You are receiving this because you were mentioned.Message ID: @.***>

-- Att Alexandre Kolisnyk

rafaelgioria commented 1 year ago

Testing with a 1.4Tb machine. mapsize = (30,24)

It´s running...

Em qua., 30 de ago. de 2023 às 15:39, rafaelgioria @.***> escreveu:

automatic map size is compute using Vesanto et al (2000) heuristic relation: total number neuron is 5*sqrt(N), with N being the number of samples.

This should not be huge. It is around 27 x 27 mapsize for 20000 samples.

— Reply to this email directly, view it on GitHub https://github.com/InTRA-USP/IntraSOM/issues/13#issuecomment-1699662961, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3MX6SRYFJQ7YEUUW4CZATXX6CGLANCNFSM6AAAAAA4E3NGZY . You are receiving this because you were mentioned.Message ID: @.***>

-- Att Alexandre Kolisnyk

That's really odd. We do run locally with less RAM cases like this. We will keep investigating it.

Edit: that's really odd to require so much RAM for a small map. One can run a map bigger than this in free colab machines (RAM typically are 16GB or less).

rodiegeology commented 1 year ago

I'll investigate and return ASAP. Usually very large maps require more RAM because of the euclidean distance matrix necessary to calculate the distances between the input vectors and the training neurons. I have been able to run up to 70x70 maps on 300.000 samples (5 features) on my personal computer (12Gb RAM). I'll look and try to reproduce the error. Thanks for the heads up.

rafaelgioria commented 1 year ago

@akol67 and @rodiegeology

You could profile the memory usage for each cell using jupyter cell-magic.

you need to install a memory profiler, for example (!is needed to install in colab cells):

!pip install memory-profiler

and add the memory profiler to jupyter execution in cell:

%reload_ext memory_profiler

It should inform the peak memory usage and maybe help us address this issue.

To use it, one needs to add the cell-magic %%memit in the first line of the jupyter cells used for the setup:

%%memit 

mapsize = (80,80)
som_test = intrasom.SOMFactory.build(data,
                                     mask=-9999,
                                     mapsize=mapsize,
                                     mapshape='toroid',
                                     lattice='hexa',
                                     normalization='var',
                                     initialization='random',
                                     neighborhood='gaussian',
                                     training='batch',
                                     name='Example',
                                     component_names=None,
                                     unit_names = None,
                                     sample_names=None,
                                     missing=False,
                                     save_nan_hist = True,
                                     pred_size=0)

I quote my output here for 153021 samples and 23 features:

Loading dataframe...
Normalizing data...
Creating neighborhood...
Initializing map...
Creating Neuron Distance Rows: 100%
80/80 [00:09<00:00, 7.22rows/s]
peak memory: 645.91 MiB, increment: 367.13 MiB

For the training cell:

%%memit 
som_test.train(train_len_factor=5,
               previous_epoch = False,
               bootstrap=False,
               )

with the ouput here

Starting Training...
Rough Training:
Epoch: 10. Radius:6.75. QE: 4.5431: 100%
10/10 [04:56<00:00, 28.29s/it]
Fine Tuning:
Epoch: 15. Radius:1.0. QE: 4.0931: 100%
15/15 [06:55<00:00, 27.86s/it]
Saving...
Training Report Created
Training completed successfully.
peak memory: 1626.34 MiB, increment: 980.27 MiB
akol67 commented 1 year ago

I'm filling out a table containing peak memory values reached as the map size increases. 16x10 21x15 26x20 .....

Em ter., 5 de set. de 2023 às 13:59, rafaelgioria @.***> escreveu:

@akol67 https://github.com/akol67 and @rodiegeology https://github.com/rodiegeology

You could profile the memory usage for each cell using jupyter cell-magic.

you need to install a memory profiler, for example (!is needed to install in colab cells):

!pip install memory-profiler

and add the memory profiler to jupyter execution:

%reload_ext memory_profiler

To use it, one needs to add the cell-magic %%memit in the first line of the jupyter cells used for the setup:

%%memit

mapsize = (80,80) som_test = intrasom.SOMFactory.build(data, mask=-9999, mapsize=mapsize, mapshape='toroid', lattice='hexa', normalization='var', initialization='random', neighborhood='gaussian', training='batch', name='Example', component_names=None, unit_names = None, sample_names=None, missing=False, save_nan_hist = True, pred_size=0)

I quote my output here for 153021 samples and 23 features:

Loading dataframe... Normalizing data... Creating neighborhood... Initializing map... Creating Neuron Distance Rows: 100% 80/80 [00:09<00:00, 7.22rows/s] peak memory: 645.91 MiB, increment: 367.13 MiB

For the training cell:

%%memit som_test.train(train_len_factor=5, previous_epoch = False, bootstrap=False, )

with the ouput here

... # going to update this

It should inform the peak memory usage.

— Reply to this email directly, view it on GitHub https://github.com/InTRA-USP/IntraSOM/issues/13#issuecomment-1706982571, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3MX6RGQBRDKTWK64OYE6TXY5LA5ANCNFSM6AAAAAA4E3NGZY . You are receiving this because you were mentioned.Message ID: @.***>

-- Att Alexandre Kolisnyk

akol67 commented 1 year ago

Note: memory footprint seems low, even for large sizes.Even so, I had to change machines (Vdi to Geocolab) to be able to run without the kernel crashing.

[image: image.png]

rafaelgioria commented 1 year ago

This still sounds like an issue with the VDI, and maybe the code.

Edit: @akol67, we haven't got the image you sent in discussion.

[image: image.png]

akol67 commented 1 year ago

The jupyter notebook kernel dies when you choose an inappropriate parameterization. In this case, the size of the map, in case it is too big Em qua., 30 de ago. de 2023 às 14:34, rafaelgioria @.> escreveu: @akol67 https://github.com/akol67 just to keep context, could you refer to the problem by linking message describing it in another issue or describe it here. — Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3MX6RBJAPZF54CDH5ZMVDXX52UDANCNFSM6AAAAAA4E3NGZY . You are receiving this because you were mentioned.Message ID: @.> -- Att Alexandre Kolisnyk

It seems to me that when using the default mapsize (or little bigger) Kernel works fine whitout dying.

akol67 commented 12 months ago

Placing the SOM inside a "for" loop was creating problems to me. After a few attempts Kernel died in the second or third iteration. The solution to run on the supercomputer (with 8 CPUs available) was to include the following lines in the code:

os.environ['OPENBLAS_NUM_THREADS'] = '1' os.environ['GOTO_NUM_THREADS'] = '1' os.environ['OMP_NUM_THREADS'] = '1'

Using n_job=8 in the som.train parameter

Now kernel is not dying anymore