dicarlolab / archconvnets

Architecturally optimized neural networks trained with regularized backpropagation
5 stars 5 forks source link

Out of memory error for large feature layers #24

Closed ardila closed 10 years ago

ardila commented 10 years ago

@yamins81 @cadieu This is the issue Charles mentioned. Here is an example (it crashes after the !!!! memory free error):

Running Command:

python extractnet.py --gpu=0 --mini=128 --test-range=6-8 --train-range=0 --data-provider=general-cropped --checkpoint-fs-port=29101 --checkpoint-fs-name=models --checkpoint-db=reference_models --load-query='{"experiment_data.experiment_id": "nyu_model"}' --feature-layer=conv3 --data-path=/export/storage/cadieu/issa_batches --dp-params='{"crop_border": 16, "meta_attribute": "id", "preproc": {"normalize": false, "dtype": "float32", "resize_to": [672, 672], "mode": "RGB", "mask": null, "crop":[264, 152, 520, 408]}, "batch_size": 128, "dataset_name": ["dldata.stimulus_sets.issa", "IssaPLStims"]}' --write-db=1 --write-disk=1

Loading checkpoint from database. Loading checkpoint from regular storage. ([6, 7, 8], [0]) Can't import separate mcc package Can't import asgd. /home/cadieu/tabular/tabular/io.py:13: DeprecationWarning: The compiler package is deprecated and removed in Python 3.x. import compiler ('Found batches: ', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]) ('Batches needed: ', []) 16 3 256 224 ('Found batches: ', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]) ('Batches needed: ', []) 16 3 256 224

loading layers from checkpoint

Importing _ConvNet C++ module 534c1ec6c626a8063eae8acd Adaptive Drop Training : False [DEFAULT] Check gradients and quit? : 0 [DEFAULT] Compress checkpoints? : 0 [DEFAULT] Conserve GPU memory (slower)? : 1
Convert given conv layers to unshared local :
Cropped DP: crop border size : 16
Cropped DP: logreg layer name (for --multiview-test) : [DEFAULT] Cropped DP: test on multiple patches? : 0 [DEFAULT] Data batch range: testing : 6-8
Data batch range: training : 0-0
Data for grouping results in database : {u'experiment_id': u'nyu_model'} Data path : /export/storage/cadieu/issa_batches Data provider : general-cropped Data provider paramers : OrderedDict([(u'crop_border', 16), (u'meta_attribute', u'id'), (u'preproc', OrderedDict([(u'crop', [264, 152, 520, 408]), (u'dtype', u'float32'), (u'mask', None), (u'mode', u'RGB'), (u'normalize', False), (u'resize_to', [672, 672])])), (u'batch_size', 128), (u'dataset_name', [u'dldata.stimulus_sets.issa', u'IssaPLStims'])]) Enable image rotation and scaling transformation : False [DEFAULT] Frequency for saving filters to db filesystem, as a multiple of testing-freq: 5
GPU override : 0
Host for Saving Checkpoints to DB : localhost [DEFAULT] Image Size : 256
Layer definition file : /home/ardila/src/archconvnets/archconvnets/convnet/nyu_model/Zeiler2013.cfg Layer parameter file : /home/ardila/src/archconvnets/archconvnets/convnet/nyu_model/layer-params.cfg Learning Rate Scale Factor : 0.01
Load file : [DEFAULT] Maximum save file size (MB) : 99999999
Minibatch size : 128
Name for gridfs FS for saved checkpoints : models
Name for mongodb database for saved checkpoints : reference_models Number of GPUs : 1 [DEFAULT] Number of channels in image : 3 [DEFAULT] Number of epochs : 50000 [DEFAULT] Port for Saving Checkpoints to DB : 29101
Query for loading checkpoint from database : OrderedDict([(u'experiment_data.experiment_id', u'nyu_model'), ('saved_filters', True)]) Random Seed : 0 [DEFAULT] Reset layer momentum : False [DEFAULT] Save checkpoints to mongo database? : 1
Test and quit? : 0 [DEFAULT] Test on one batch at a time? : 1 [DEFAULT] Testing frequency : 200
Unshare weight matrices in given layers :
Whether filp training image : 1
Write all data features from the dataset to mongodb in standard format : 1
Write features from given layer(s) : conv3
Write features to this path (to be used with --write-disk) : [DEFAULT] Write test data features from --layer to --feature-path) : 1
Uploading batch 6 to database Wrote feature file _conv3/data_batch_6 Uploading batch 7 to database Wrote feature file _conv3/data_batch_7 !!!! memory free error

ardila commented 10 years ago

@cadieu I'm noticing you did not set the crop_border in dp_params option back to 0, which you should do. The model only accepts 224 by 224 images, which you are currently achieving by resizing, then cropping deterministically to 256, then a random crop of of 224*224 from this.

Sorry, this should be more clear in the wiki, which I am updating now.

ardila commented 10 years ago

@cadieu I can try and look into this error and fix it, but I need to give myself access to the dataset on AWS, is that alright?

yamins81 commented 10 years ago

It looks like the first batch I'd successfully computed. It's thats correct? Does it crash at the same place every time? On May 8, 2014 4:47 PM, "Diego Ardila" notifications@github.com wrote:

@cadieu https://github.com/cadieu I can try and look into this error and fix it, but I need to give myself access on AWS, is that alright?

— Reply to this email directly or view it on GitHubhttps://github.com/dicarlolab/archconvnets/issues/24#issuecomment-42604136 .

cadieu commented 10 years ago

@ardila This line seems to indicate that under the testing condition and trim borders the center crop is taken, not a random one: https://github.com/dicarlolab/archconvnets/blob/master/archconvnets/convnet/convdata.py#L342

cadieu commented 10 years ago

I have not checked the determinacy of the crash.
I just checked in some minor edits so that you can run this code.

The dataset is on S3 under the usual dicarlocox AWS credentials. It should just work with an updated dldata repo. I think ;)

ardila commented 10 years ago

I've now run into this issue as well with a different extraction, on a different machine (kraken4).

~/src/archconvnets/archconvnets/convnet$ python extractnet.py --test-range=0-112 --train-range=0 --data-provider=general-cropped --checkpoint-fs-port=29101 --checkpoint-fs-name=models --checkpoint-db=reference_models --load-query='{"experiment_data.experiment_id": "nyu_model"}' --feature-layer=pool5,fc6 --data-path=/export/storage2/ardila/SUN_attribute_batches --dp-params='{"crop_border": 16, "meta_attribute": "filename", "preproc": {"normalize": false, "dtype": "float32", "resize_to": [256, 256], "mode": "RGB", "mask": null, "crop":null}, "batch_size": 128, "dataset_name": ["dldata.stimulus_sets.SUNAttribute", "SUNAttribute"]}' --write-db=1 --write-disk=0 Loading checkpoint from database. Loading checkpoint from regular storage. ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112], [0]) Can't import separate mcc package Can't import asgd. Can't import bangmetric /home/ardila/src/tabular/tabular/io.py:13: DeprecationWarning: The compiler package is deprecated and removed in Python 3.x. import compiler ('Found batches: ', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112]) ('Batches needed: ', []) 16 3 256 224 ('Found batches: ', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112]) ('Batches needed: ', []) 16 3 256 224

loading layers from checkpoint

Importing _ConvNet C++ module 534c1ec6c626a8063eae8acd Adaptive Drop Training : False [DEFAULT] Check gradients and quit? : 0 [DEFAULT] Compress checkpoints? : 0 [DEFAULT] Conserve GPU memory (slower)? : 1
Convert given conv layers to unshared local :
Cropped DP: crop border size : 16
Cropped DP: logreg layer name (for --multiview-test) : [DEFAULT] Cropped DP: test on multiple patches? : 0 [DEFAULT] Data batch range: testing : 0-112
Data batch range: training : 0-0
Data for grouping results in database : {u'experiment_id': u'nyu_model'} Data path : /export/storage2/ardila/SUN_attribute_batches Data provider : general-cropped Data provider paramers : OrderedDict([(u'crop_border', 16), (u'meta_attribute', u'filename'), (u'preproc', OrderedDict([(u'crop', None), (u'dtype', u'float32'), (u'mask', None), (u'mode', u'RGB'), (u'normalize', False), (u'resize_to', [256, 256])])), (u'batch_size', 128), (u'dataset_name', [u'dldata.stimulus_sets.SUNAttribute', u'SUNAttribute'])]) Enable image rotation and scaling transformation : False [DEFAULT] Frequency for saving filters to db filesystem, as a multiple of testing-freq: 5
GPU override : 0
Host for Saving Checkpoints to DB : localhost [DEFAULT] Image Size : 256
Layer definition file : /home/ardila/src/archconvnets/archconvnets/convnet/nyu_model/Zeiler2013.cfg Layer parameter file : /home/ardila/src/archconvnets/archconvnets/convnet/nyu_model/layer-params.cfg Learning Rate Scale Factor : 0.01
Load file : [DEFAULT] Maximum save file size (MB) : 99999999
Minibatch size : 256
Name for gridfs FS for saved checkpoints : models
Name for mongodb database for saved checkpoints : reference_models Number of GPUs : 1 [DEFAULT] Number of channels in image : 3 [DEFAULT] Number of epochs : 50000 [DEFAULT] Port for Saving Checkpoints to DB : 29101
Query for loading checkpoint from database : OrderedDict([(u'experiment_data.experiment_id', u'nyu_model'), ('saved_filters', True)]) Random Seed : 0 [DEFAULT] Reset layer momentum : False [DEFAULT] Save checkpoints to mongo database? : 1
Test and quit? : 0 [DEFAULT] Test on one batch at a time? : 1 [DEFAULT] Testing frequency : 200
Unshare weight matrices in given layers :
Whether filp training image : 1
Write all data features from the dataset to mongodb in standard format : 1
Write features from given layer(s) : pool5,fc6 Write features to this path (to be used with --write-disk) : [DEFAULT] Write test data features from --layer to --feature-path) : 0
Uploading batch 0 to database Uploading batch 1 to database Uploading batch 2 to database Uploading batch 3 to database Uploading batch 4 to database Uploading batch 5 to database Uploading batch 6 to database Uploading batch 7 to database !!!! memory free error

ardila commented 10 years ago

My only hunch so far is that this has something to do with the fact that this model was trained with large batches (256), and somehow having to load in the smaller batches separately is leading to problems. I'm going to try rewriting my batches as 256 and see if this fixes the issue.

ardila commented 10 years ago

I can confirm that the crash happens at the same point every time (2 times now). Also, I can't see any memory problems with nvidia-smi or htop.

yamins81 commented 10 years ago

Maybe load the image data for that batch by hand and see if something about the images or labels jumps out. If nothing does jump out, try just extracting the data layer itself, as opposed to conv3, and see if that works. If it works, keep going up the layers until it breaks. Hopefully this will be diagnostic. On May 8, 2014 6:59 PM, "Diego Ardila" notifications@github.com wrote:

I can confirm that the crash happens at the same point every time (2 times now). Also, I can't see any memory problems with nvidia-smi or htop.

— Reply to this email directly or view it on GitHubhttps://github.com/dicarlolab/archconvnets/issues/24#issuecomment-42616640 .

ardila commented 10 years ago

Note that these are different datasets, and different layers (I'm trying pool4) that I am attempting to extract.

yamins81 commented 10 years ago

Sorry, I'm on a mobile device.. They both crashed right after batch 7 so it was natural to think it was the same dataset.

Still, I would start by looking at the data batch itself. Then if you don't see anything obvious, try just "extracting" the data layer itself in that batch. If that doesn't flush the problem, I'd change the batch size. If that fails to solve things, restart the machines. On May 8, 2014 7:08 PM, "Diego Ardila" notifications@github.com wrote:

Note that these are different datasets, and different layers (I'm trying pool4) that I am attempting to extract.

— Reply to this email directly or view it on GitHubhttps://github.com/dicarlolab/archconvnets/issues/24#issuecomment-42617220 .

ardila commented 10 years ago

I agree, this is a very weird bug that I wasn't even trying to recreate when I did.

I will try to take a deeper look at this tomorrow, but for now I'm just going to increase the batch size since that has worked for me for various datasets already. note: pool4 doesn't exist (oops) I meant pool5.

ardila commented 10 years ago

Confirming that changing to a batch size of 256 did not fix this issue, now trying to write the data layer.

ardila commented 10 years ago

I have the same problem with extracting the data layer. With 256, the error happens after roughly the same amount of images (batch 3 instead of batch 7). I looked into the batch where the error happens and nothing sticks out. Same mean and variance for the data, and the labels are exactly what you would expect. It seems that some slightly more recent commit has somehow broken extractnet.py, since this is the same model and same parameters that I had already used to extract HvM and imagenet. I am double checking this now by trying to reextract just the first few batches of imagenet.

@yamins81 @daseibert Any ideas what could be causing this?

yamins81 commented 10 years ago

Can you write out other sets of (more) batches? Does it always break after the same number of images/batches? Or is it something about that specific batch?

On Fri, May 9, 2014 at 1:57 PM, Diego Ardila notifications@github.comwrote:

I have the same problem with extracting the data layer. With 256, the error happens after roughly the same amount of images (batch 3 instead of batch 7). I looked into the batch where the error happens and nothing sticks out. Same mean and variance for the data, and the labels are exactly what you would expect. It seems that some slightly more recent commit has somehow broken extractnet.py.

— Reply to this email directly or view it on GitHubhttps://github.com/dicarlolab/archconvnets/issues/24#issuecomment-42694919 .

ardila commented 10 years ago

Doesn't seem to be specific to this batch, since Charles was running a different dataset, and Charles also told me that he had success by just looping and doing a small amount of batches at a time but I can try to confirm this. It seems to break after the same number of images (batch 7 when batch size was 128, batch 3 when batch size was 256)

On Fri, May 9, 2014 at 2:10 PM, Dan Yamins notifications@github.com wrote:

Can you write out other sets of (more) batches? Does it always break after the same number of images/batches? Or is it something about that specific batch?

On Fri, May 9, 2014 at 1:57 PM, Diego Ardila notifications@github.comwrote:

I have the same problem with extracting the data layer. With 256, the error happens after roughly the same amount of images (batch 3 instead of batch 7). I looked into the batch where the error happens and nothing sticks out. Same mean and variance for the data, and the labels are exactly what you would expect. It seems that some slightly more recent commit has somehow broken extractnet.py.

— Reply to this email directly or view it on GitHub< https://github.com/dicarlolab/archconvnets/issues/24#issuecomment-42694919>

.

— Reply to this email directly or view it on GitHubhttps://github.com/dicarlolab/archconvnets/issues/24#issuecomment-42696268 .

ardila commented 10 years ago

I also seem to not be having this problem when re-extracting the first 8 batches of Imagenet.

yamins81 commented 10 years ago

I mean, if you start somewhere else in the dataset e.g. some later range of batches, does the error happen at all? Does it happen at the same relative point (e.g. after 3 batches)? Or is it random?

On Fri, May 9, 2014 at 2:13 PM, Diego Ardila notifications@github.comwrote:

Doesn't seem to be specific to this batch, since Charles was running a different dataset, and Charles also told me that he had success by just looping and doing a small amount of batches at a time but I can try to confirm this

On Fri, May 9, 2014 at 2:10 PM, Dan Yamins notifications@github.com wrote:

Can you write out other sets of (more) batches? Does it always break after the same number of images/batches? Or is it something about that specific batch?

On Fri, May 9, 2014 at 1:57 PM, Diego Ardila notifications@github.comwrote:

I have the same problem with extracting the data layer. With 256, the error happens after roughly the same amount of images (batch 3 instead of batch 7). I looked into the batch where the error happens and nothing sticks out. Same mean and variance for the data, and the labels are exactly what you would expect. It seems that some slightly more recent commit has somehow broken extractnet.py.

— Reply to this email directly or view it on GitHub<

https://github.com/dicarlolab/archconvnets/issues/24#issuecomment-42694919>

.

— Reply to this email directly or view it on GitHub< https://github.com/dicarlolab/archconvnets/issues/24#issuecomment-42696268>

.

— Reply to this email directly or view it on GitHubhttps://github.com/dicarlolab/archconvnets/issues/24#issuecomment-42696637 .

ardila commented 10 years ago

Charles has tried starting at a later point and not had a problem. I will try and confirm this as a next step. It happens at the same relative point w/r/t to images (batch 3 when the batch size was 256, batch 7 when the batch size was 128)

ardila commented 10 years ago

Ok so checking it, and found that I can't extract batch 3 (256 size batches) even if it is first, and even if I only ask for the data layer. I have loaded batch_0 to try and compare but I still see nothing obvious. Mean, Var, and shape are what you would expect.

ardila commented 10 years ago

It looks like using a meta field that has fewer unique labels solves this issue, though it is not clear why. In order to my extraction, I added a filed called "dummy" that is just the string 'dummy' for every image.

@yamins81 What would be a good solution to this more generally? Often when writing batches for extractions, the meta field is actually irrelevant and shouldn't need to be specified. This only seems useful for training or testing models using convnet.py.

ardila commented 10 years ago

Looks like this is solved by just creating a dummy meta field that has not too many unique entries.