apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.2k stars 1.14k forks source link

kernel: Out of memory: Kill process 27755 (python) score 614 or sacrifice child #3129

Open Lif0820 opened 4 years ago

Lif0820 commented 4 years ago

I train the Image Classification model on GPU, after the last iteration is done, the python process is killed because of system idle memory decrease sharply. It seems like there is something wrong at the last of create() method.

Now, I will try the same training on CPU to see does the problem happens....

tc.config.set_num_gpus(-1) model = tc.image_classifier.create(train_data, target='label', max_iterations=30) print('training is done') model.save('dsn.model')

System info: TuriCreate: 6.2 Memory: 32G GPU: 7G, CUDA Version: 10.2

Part1: here is the log in /var/log/ kernel: Out of memory: Kill process 27755 (python) score 614 or sacrifice child

Part2: here is training log Logistic regression:

--------------------------------------------------------

Number of examples : 33945 Number of classes : 647 Number of feature columns : 1 Number of unpacked features : 2048 Number of coefficients : 1323654 Starting L-BFGS

--------------------------------------------------------

+-----------+----------+-----------+--------------+-------------------+---------------------+ | Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy | +-----------+----------+-----------+--------------+-------------------+---------------------+ | 0 | 2 | 1.000000 | 105.199852 | 0.022802 | 0.023503 | | 1 | 4 | 1.000000 | 241.801404 | 0.053439 | 0.051483 | | 2 | 9 | 0.773445 | 535.247982 | 0.088526 | 0.093453 | | 3 | 10 | 0.966806 | 620.098639 | 0.201974 | 0.198657 | | 4 | 11 | 1.000000 | 705.081587 | 0.410458 | 0.383324 | | 6 | 17 | 0.727341 | 1084.986714 | 0.543526 | 0.532177 | | 7 | 18 | 0.909176 | 1169.897273 | 0.589866 | 0.576385 | | 8 | 19 | 1.000000 | 1255.140668 | 0.623155 | 0.609961 | | 9 | 20 | 1.000000 | 1339.704462 | 0.632936 | 0.623951 | | 10 | 21 | 1.000000 | 1424.285111 | 0.646605 | 0.641298 | | 11 | 22 | 1.000000 | 1509.028645 | 0.657416 | 0.647454 | | 12 | 23 | 1.000000 | 1593.798588 | 0.673207 | 0.665921 | | 13 | 25 | 1.000000 | 1730.773736 | 0.677537 | 0.666480 | | 14 | 26 | 1.000000 | 1815.356839 | 0.705465 | 0.692781 | | 15 | 27 | 1.000000 | 1900.395668 | 0.713713 | 0.697258 | | 16 | 28 | 1.000000 | 1985.397270 | 0.739579 | 0.719082 | | 17 | 29 | 1.000000 | 2070.702618 | 0.749389 | 0.729715 | | 18 | 30 | 1.000000 | 2155.380468 | 0.758403 | 0.734191 | | 19 | 31 | 1.000000 | 2240.224900 | 0.765768 | 0.734751 | | 20 | 32 | 1.000000 | 2324.879639 | 0.777994 | 0.743145 | | 21 | 33 | 1.000000 | 2409.892064 | 0.790691 | 0.749860 | | 22 | 34 | 1.000000 | 2495.024232 | 0.799322 | 0.752658 | | 23 | 35 | 1.000000 | 2579.768032 | 0.810311 | 0.758254 | | 24 | 36 | 1.000000 | 2665.828144 | 0.818147 | 0.759933 | | 25 | 37 | 1.000000 | 2750.568400 | 0.828693 | 0.767767 | | 26 | 38 | 1.000000 | 2835.482493 | 0.838769 | 0.766088 | | 27 | 39 | 1.000000 | 2920.270576 | 0.846045 | 0.770565 | | 28 | 40 | 1.000000 | 3005.355743 | 0.855796 | 0.767208 | | 29 | 41 | 1.000000 | 3090.212993 | 0.862336 | 0.774482 | +-----------+----------+-----------+--------------+-------------------+---------------------+ [1]+ Killed

syoutsey commented 4 years ago

Hi @Lif0820 what OS are you using? How big is the dataset you're training on?

TobyRoseman commented 4 years ago

In addition to the above questions, please also try adding validation_set=None to your tc.image_classifier.create call. Let us know if that works.

Lif0820 commented 4 years ago

Hi @Lif0820 what OS are you using? How big is the dataset you're training on?

CentOS Linux release 7.3.1611 (Core) The image dataset is about 430Mb with 44,000+ images

Lif0820 commented 4 years ago

In addition to the above questions, please also try adding validation_set=None to your tc.image_classifier.create call. Let us know if that works.

Thanks for your advice, I will try it tonight。 I retried 1/3 of sample images last night, it finished successfully. But at the same step metioned above(after iteration No.30 is finished, but before the model is saved), the system memory is only 600+Mb left. I had set the runtime config TURI_FILEIO_MAXIMUM_CACHE_CAPACITY = 2 1024 1024 * 1024, which is set to about 8Gb default on my server, but this change seems like does not work well on my problem.

Lif0820 commented 4 years ago

In addition to the above questions, please also try adding validation_set=None to your tc.image_classifier.create call. Let us know if that works.

Thanks for your advice, I will try it tonight。 I retried 1/3 of sample images last night, it finished successfully. But at the same step metioned above(after iteration No.30 is finished, but before the model is saved), the system memory is only 600+Mb left. I had set the runtime config TURI_FILEIO_MAXIMUM_CACHE_CAPACITY = 2 1024 1024 * 1024, which is set to about 8Gb default on my server, but this change seems like does not work well on my problem.

Last nigth I tried add validation_set=None to tc.image_classifier.create() , but it is does not work eighter.

Lif0820 commented 4 years ago

I have a guess, I see the code in the source code like this: input_image_shape=(3, 224, 224) dtype=np.float32 and i have 44,000 images, the memory needed to load them all is: 44000 3 224 224 32 / 8 / 1024 / 1024 / 1024 = 25Gb I an not sure does this guess right.

cd-slash commented 4 years ago

I'm experiencing the same problem, which occurs at exactly the same point in my GPU training loop on each attempt (initiated by model = tc.one_shot_object_detector.create(starter_images, 'label')). I am trying to train on 50 initial images, each a .png about 48kb in size.

I'm using Ubuntu 20.04, Python 3.7.7, Cuda 10.0, tensorflow-gpu 2.0.1. Running on a GTX 1080 Ti with ~90GB RAM available.

The script creates the training images as expected, then starts the GPU-based training. At some point after iteration 1,540 (of 35,000 max iterations as determined by turicreate) the script hangs for several minutes, then simply returns killed exactly as @Lif0820 has shown above:

Augmenting input images using 951 background images.
+------------------+--------------+------------------+
| Images Augmented | Elapsed Time | Percent Complete |
+------------------+--------------+------------------+
| 100              | 14.87s       | 1%               |
| 200              | 19.87s       | 2%               |
| 300              | 27.63s       | 3%               |
[truncated - completes successfully]

Using 'image' as feature column
Using 'annotation' as annotations column
Using a GPU to create model.
Setting 'batch_size' to 32
WARNING:tensorflow:From /home/<redacted>/.pyenv/versions/3.7.7/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Setting 'max_iterations' to 15000
+--------------+--------------+--------------+
| Iteration    | Loss         | Elapsed Time |
+--------------+--------------+--------------+
| 1            | 21.0111      | 12.28s       |
| 2            | 20.1127      | 13.54s       |
| 3            | 19.6274      | 14.67s       |

[truncated]

| 1540            | 1.1872      | 32m 15s       |
killed

Is this an out-of-memory issue as OP has suggested? It seems to make sense as this is occurring each time I have tried it at exactly the same point in the loop - after iteration 1,540 - but it seems somewhat odd as I would have expected all images to be loaded into memory at the start of the training cycle? And with ~90GB of system memory, I would have thought this would be plenty for around 50 training images.

Any ideas?

TobyRoseman commented 4 years ago

@BoolHandLuke - I don't believe this is the same issue. @Lif0820 is doing image classification. You're doing one shot object detection. Please create a separate issue.

laisangbum commented 3 years ago

I'm experiencing the same problem, which occurs at exactly the same point in my GPU training loop on each attempt (initiated by model = tc.one_shot_object_detector.create(starter_images, 'label')). I am trying to train on 50 initial images, each a .png about 48kb in size.

I'm using Ubuntu 20.04, Python 3.7.7, Cuda 10.0, tensorflow-gpu 2.0.1. Running on a GTX 1080 Ti with ~90GB RAM available.

The script creates the training images as expected, then starts the GPU-based training. At some point after iteration 1,540 (of 35,000 max iterations as determined by turicreate) the script hangs for several minutes, then simply returns killed exactly as @Lif0820 has shown above:

Augmenting input images using 951 background images.
+------------------+--------------+------------------+
| Images Augmented | Elapsed Time | Percent Complete |
+------------------+--------------+------------------+
| 100              | 14.87s       | 1%               |
| 200              | 19.87s       | 2%               |
| 300              | 27.63s       | 3%               |
[truncated - completes successfully]

Using 'image' as feature column
Using 'annotation' as annotations column
Using a GPU to create model.
Setting 'batch_size' to 32
WARNING:tensorflow:From /home/<redacted>/.pyenv/versions/3.7.7/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Setting 'max_iterations' to 15000
+--------------+--------------+--------------+
| Iteration    | Loss         | Elapsed Time |
+--------------+--------------+--------------+
| 1            | 21.0111      | 12.28s       |
| 2            | 20.1127      | 13.54s       |
| 3            | 19.6274      | 14.67s       |

[truncated]

| 1540            | 1.1872      | 32m 15s       |
killed

Is this an out-of-memory issue as OP has suggested? It seems to make sense as this is occurring each time I have tried it at exactly the same point in the loop - after iteration 1,540 - but it seems somewhat odd as I would have expected all images to be loaded into memory at the start of the training cycle? And with ~90GB of system memory, I would have thought this would be plenty for around 50 training images.

Any ideas?

I got the same problem. with 700 images dataset. The program stuck at
Using 'image' as feature column Using 'annotation' as annotations column It seems all computer stuck, I cannot ssh to the gcs.

TobyRoseman commented 3 years ago

@laisangbum - this GitHub issue is related to image classifier. It looks like you're doing one shot object detection. I recommend creating a new issue.