Open den-run-ai opened 4 years ago
My following comments assume you're NOT using the python front end (I can't answer whether or not it's duplicated).
None of the following depends on the mini-batch size.
First, without using the data_store (This is the default which you'll get if you don't include one of the cmd line flags: --preload_data_store, --use_data_store, --data_store_cache): each image is read from file when it's requested; so each image will be read from file once per epoch. You should probably use the data_store for better performance.
Using the data_store: --use_data_store each image is read from file the first time it's requested (during the 1st epoch) and is then cached into memory in the data_store --preload_data_store all images are read from file into memory prior to commencement of the 1st epoch, then cached into memory in the data_store In the preceeding, each rank 'owns' a portion of the data, and there is a data-exchange MPI-based function call at the beginning of each epoch (except the first, if --use_data_store)
--data_store_cache each node keeps a copy of the entire data set in a shared memory segment. In this scenario, like --preload_data_store, all data is read prior to the 1st epoch, however, there are no subsequent data-exchange calls
Which gives best performance (meaning: shortest execution time).? Well, we haven't really benchmarked, but here are some guidelines (note: there's some speculation here on my part): --data_store_cache should be fastest, IF you have sufficient memory I suspect that --use_data_store may be a bit faster than --preload_data_store, but if you're running a large number of epochs they should perform about equally.
re, "hooks for parallel file systems" (e.g, lustre) there's nothing that I can think of.
What is the data loading strategy in the data reader? For example do you preload some input images from imagenet or all data is loaded on-demand depending on the batch size? Are there any configuration parameters to tweak this behavior? Are there any hooks for parallel file systems? Also is the prototext frontend deprecated for vision models?
https://github.com/LLNL/lbann/blob/develop/model_zoo/data_readers/data_reader_imagenet.prototext
https://github.com/LLNL/lbann/blob/develop/python/lbann/models/resnet.py