Open PurvangL opened 2 years ago
Hey @PurvangL,
could you elaborate on how do you want crop those images/labels?
Cropping without normalizing can be achieved with the fn.crop
operator. It has the same set of arguments related to crop as fn.crop_mirror_normalize
.
I guess you want you want to provide the crop
argument to be [args.crop_height, args.crop_width]
. If you want to randomize the position of the cropping, you might use the fn.random.uniform
operators and pass their outputs to the crop_pos_x
and crop_pos_y
arguments of the fn.crop_mirror_normalize
and fn.crop
.
Hi @banasraf Thank you for your reply. I have modified function as below to apply cropping on images and labels.
@pipeline_def(batch_size=args.batch_size)
def data_pipeline(shard_id):
pngs = fn.readers.caffe2(
path=os.path.join(args.dataset, "train", "images"), random_shuffle=True, shard_id=shard_id, num_shards=args.num_gpus)
labels = fn.readers.caffe2(
path=os.path.join(args.dataset, "train", "labels"), random_shuffle=True, shard_id=shard_id, num_shards=args.num_gpus)
images = fn.decoders.image(pngs, device='mixed', output_type=types.RGB)
labels = fn.decoders.image(labels, device='mixed', output_type=types.RGB)
# images = fn.crop_mirror_normalize(
# images, dtype=types.FLOAT, std=[255.], output_layout="CHW")
images = fn.crop(images,crop=[args.crop_height, args.crop_width])
labels = fn.crop(labels,crop=[args.crop_height, args.crop_width])
# return images, labels.gpu()
return images, labels
I am getting following error when I run programm.
Traceback (most recent call last):
File "distributed_training/distributed_training_dali.py", line 344, in <module>
train_dataset = strategy.distribute_datasets_from_function(dataset_fn, input_options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1160, in distribute_datasets_from_function
dataset_fn, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 597, in _distribute_datasets_from_function
options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 168, in get_distributed_datasets_from_function
input_contexts, dataset_fn, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1566, in __init__
input_contexts, self._input_workers, dataset_fn))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2301, in _create_datasets_from_function_with_input_context
dataset = dataset_fn(ctx)
File "distributed_training/distributed_training_dali.py", line 328, in dataset_fn
device_id=device_id)
File "/usr/local/lib/python3.6/dist-packages/nvidia/dali/plugin/tf.py", line 768, in __init__
if _has_external_source(pipeline):
File "/usr/local/lib/python3.6/dist-packages/nvidia/dali/external_source.py", line 787, in _has_external_source
pipeline._build_graph()
File "/usr/local/lib/python3.6/dist-packages/nvidia/dali/pipeline.py", line 615, in _build_graph
raise TypeError(f"Illegal pipeline output type. The output {i} contains a nested "
TypeError: Illegal pipeline output type. The output 0 contains a nested `DataNode`. Missing list/tuple expansion (*) is the likely cause.
Could you please let me know how can I fix this? Thank you
@banasraf Is there any additional information that I need to provide? please let me know. Thank you
@PurvangL
Okay, I managed to find the issue. caffe2
reader returns files and labels by default, so the output of the operator is a pair (files, labels). This means that in your case the images
is a pair of DataNode
s.
I guess that in your case you don't have any numeric labels for your images, so you might pass the label_type=4
to the reader (see documentation of the operator). This way the operator will return the images only.
@PurvangL
Also, please be sure that the order of labels and images is the same between those two LMDB files. The two readers have the same random seed in your case, so they should generate the same order with random_shuffle
but only assuming the same source order in the datasets.
@banasraf Thank you for your answer. This is the simplest example I am trying to work with. After adding your suggestions, I am still have issue running it without any error. Below is the link for my example which I am trying to run.
tf_dali.py file consist all the information including dependencies to be installed and command to run.
https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing
please let me know if any additional information is needed.
Regards, Purvang
Hi @PurvangL,
I think the file reader operator is the one you are looking for:
@pipeline_def(batch_size=args.batch_size)
def data_pipeline(shard_id):
images_path = os.path.join(args.dataset, "train", "images")
labels_path = os.path.join(args.dataset, "train", "labels")
images_files = os.listdir(images_path)
labels_files = os.listdir(labels_path)
images_files = [os.path.join(images_path, f) for f in images_files]
labels_files = [os.path.join(labels_path, f) for f in labels_files]
pngs, _ = fn.readers.file(
files =images_files,
random_shuffle=True,
shard_id=shard_id,
num_shards=args.num_gpus,
seed=SEED)
labels, _ = fn.readers.file(
files = labels_files,
random_shuffle=True,
shard_id=shard_id,
num_shards=args.num_gpus,
seed=SEED)
images = fn.decoders.image(pngs, device='mixed', output_type=types.RGB)
labels = fn.decoders.image(labels, device='mixed', output_type=types.RGB)
images = fn.crop(images, crop=[args.crop_height, args.crop_width])
labels = fn.crop(labels, crop=[args.crop_height, args.crop_width])
return images, labels
Just make sure that the order of files inside images_files
and labels_files
matches.
Thank you @JanuszL . I was able to start training with the changes. But there are two scenarios that I am having difficulties with. 1) Training Time: Now after adding Dali pipeline operator and training on multiple gpu, each iteration and overall training time now increased to 5x compared to when using tf.data.Dataset object input pipeline. What could be wrong here? 2) Batch Size: With tf.data.Dataset object input pipeline, I was able to feed 162 global batch size, for 2 gpus. Now I can only use 6 batch size and if I increase, it throws OOM. Does 6 batch size is local and is it for each Gpu, making global batch size of 12?
Thank you Purvang
Hi @PurvangL,
1) I would check the epoch length. Maybe there is some issue with setting the number of shards and shard_id in DALI so each GPU goes over the whole data set instead of only a shard that should be assigned to it. 2) The batch size you set in DALI is per GPU. It is expected that moving data processing to the GPU will limit the memory available for the model, hence you may want to reduce the batch size, however, the overall throughput may increase in some cases.
If you suspect something wrong with the overall performance you may check GPU utilization first and then capture profile.
@JanuszL Thanks for your reply. Using DALI, there are 59 iterations to complete each epoch, when batch size is 50 (assuming per Gpu based on last message) Therefore Global batch size: 400 and 8 gpus are used. Using tf.data.Dataset with batch size of 400 which is global batch size (per Gpu bs: 50), only 7 iterations were performed to complete one epoch. How can I correct this?
@PurvangL,
Please make sure that shard_id
and num_shards
are set accordingly for each GPU. If you can share your script of some minimal repro that just iterates over the data set I can check it.
Sure @JanuszL
https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing
This link has both script and as well dataset. Please let me know if there is anything missing. Thank you
@PurvangL,
I think you need to adjust steps_per_epoch
based on the batch size per GPU and the number of GPUs.
Thank you for your reply @JanuszL . With this change, I was able to get correct iterations per batch. I am using 2 x A100 40Gb Nvidia Gpus and with using DALI pipeline, I am only able to get per gpu batch size of 4 (Global Bs-8), where using tf.data, I can use 421 Global Batch size, and if dataset was bigger, I could go upto 2600 with image resolution of (128,128).
I have question why this big difference in max batch size using DALI?
Below in Gdrive I have uploaded my test result for CamVid dataset comparing DALI and tf.data pipeline. DALI still seems a bit slower and I see difference in val_mean_iou as well. What could be the reason?
https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing
Please let me know if any additional information needed.
Thank you
Maybe TensorFlow competes with DALI for the GPU memory. Can you try limiting TF to use only as much memory as it needs?
@JanuszL Thank you for your reply.
After trial and error, I have limited each Gpu memory to 4GB for tensorflow out of 40GB gpu memory, when I use DALI data loading pipeline. Max batch size, I am able to run with tf.data pipeline is 650 (global batch size) for 4xA100, 40GB gpus and max batch size with DALI Data pipeline is 64 (global batch size) for same hardware.
Also, when I check gpu memory usage using "nvidia-smi", For DALI pipeline, 6867 MiB is used out of 40GB for each gpu and for tf.data pipeline, 39739 MiB is used out of 40GB. Below I have attached my test result.
Is there anything that I am missing or any suggestions for training time improvement?
Thank you
Hi @PurvangL,
I added:
for gpu in tf.config.list_physical_devices('GPU'):
tf.config.experimental.set_memory_growth(gpu, True)
at the very beginning of the training script. I can run batch_size 128
with 12GB of GPU memory. Can you try this out on your side?
Thank you @JanuszL . First of all, based on your previous reply about tensorflow competing gpu memory,
tf.config.experimental.set_memory_growth(gpu, True)
was already present and I changed it to limit the gpu memory for each gpu using
tf.config.set_logical_device_configuration(gpu, [tf.config.LogicalDeviceConfiguration(memory_limit=4000)])
I was also able to fit 300 global batch size using DALI (150 each gpu, total 2 gpu) and trained on Cityscapes dataset, and 5 epochs training time is 2m6.087s . On the other hand, I was able to use 700 global batch size with tf.data, and train time is 1m31.264s. Trying to understand why training time using DALI data loading is slow? What improvements need to make to train faster?
Thank you
Hi @PurvangL,
Based on https://github.com/NVIDIA/DALI/issues/4179#issuecomment-1238613846 I thought that you cannot use batch size greater than 16. The thing that you may want to check is this blog post, so you learn if your training is really limited by the data processing. Also, your data sets consist of png images and DALI doesn't provide a GPU acceleration for them (we use nvJPEG and nvJPEG2000 to GPU accelerate JPEG and JPEG2000 decoding, but there is no GPU accelerated PNG decoder). So maybe in your case, the GPU is already loaded with work and even moving crop operation there doesn't provide much value. Also, did you use the same number of CPU threads in DALI and tf.data (so we know that we are comparing apples to apples here).
@JanuszL . Thank you for the answer. point noted regarding the image format.
https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing
This gdrive location contains both scripts, one with DALI and other one with tf.data.
My goal is to see whether using DALI can reduce training time or not?
Note: I have duplicated the images to test with higher batch size as my main goal is to reduce the training time. I have not updated gdrive dataset with duplicated images and it still contains ~421 training images.
Please let me know if anything needed from my end.
Hi @PurvangL,
Can you capture the profile according to this guide and see what the timeline looks like?
Hi @JanuszL I have rerun both the script with profiling. I have added profiling data in mentioned google drive link. Could you please checkout and explain your analysis and also recommended steps?
Thank you
@PurvangL,
The main difference I see is that TF uses all available CPU cores while DALI sticks to the default value in your case. You can use num_threads
in the DALIDataset
so assign more threads to DALI to speed up the png decoding (which is CPU bound). The rest of the processing is almost negligible and there is a little speed up from moving it to the GPU (in your case it is only about crop).
@JanuszL . Indeed that solved the problem. So in summary, I was trying to compare 1) 4xA100 (80GB each) PCIe machine 2) 4xA100 (80GB each) SXM4 machine and compare them for training time and inference time. Using MLperf submission by Nvidia, SXM machine was definitely faster, but when I used with other algorithms not submitted in MLperf, the result was other way. By analyzing profiling result, SXM machine was taking more time for kernel launch (tf profiling) and memcpy operations (ncu profiling)(Any reason why?), which I felt could be the reason. Integrating DALI to algorithm did help and made SXM machine training faster. Any other tricks or feature that I could integrate or you recommend to use SXM machine for it's full potential?
related issue:
https://forums.developer.nvidia.com/t/file-not-found-cuvectoraddmulti-exe/227440/2
Thank you
Hi @PurvangL,
Can you check if the CPU and GPU clocks are set the same way on both machines? For example, this GPU talk can provide some guidance.
@JanuszL Thank you for the reply. I will. Meanwhile I am facing new problem after integrating DALI. Getting poor results (both training time as well as mean_iou) when increasing batch size or number of gpus for training. Below is my test result. I am linearly scaling learning rate with number of gpus.
what could be the reason?
Thanks
Hi @PurvangL,
Can you confirm that the number of iterations decreases with the increase of the batch size of the number of GPUs? Can you also do a dry run to make sure that all dataset samples are returned during one epoch?
Hi @JanuszL Yes, I confirm with increasing batch size in number of GPUS, iterations does reduces linearly. I have two questions here. when I use num_threads = cpu_count(), 1) in same machine, increasing batch size with number of Gpu taking more time for same dataset and for same number of epochs. I confirm iteration reduces linearly with increment in batch size and Gpu. 2) machine with 2x Gpu memory and double batch size taking more time.
below I have attached my result.
when I use num_threads = 64, result does change, but 8 gpu is still concern.
Do I need to find optimal num_threads ?
Thank you
Hi @PurvangL,
Can you check what is the iteration time? How does it change with the number of GPUs (I would expect it to remain similar)?
Can you also tell what is in the Total_Epochs
column? I would expect that the number of epochs (time you train over the whole data set) remains the same compared to the number of GPUs used and the batch size.
Hi @JanuszL . Thank you for your reply. I am currently facing issue of running for longer time when increasing number of GPUs and batch size in 8xA100 80G server. except this, all servers (PCIe, SXM) are getting result as expected.
Iteration time: 2 GPUs : 7s 4 GPUs : 5s 8 GPUs : 5s
Should be 8 GPUs iteration time less compared to 4 Gpus?
Hi @PurvangL,
I would expect that the iteration time would slightly increase when you increase the number of GPUs (synchronization and data exchange overhead), but the total number of iterations decreases and that is the gain. I would capture the profiles from both cases and compare them.
Hi @JanuszL
I have captured the profiling data for both 4 gpu and 8 gpu training.
Here I have attached the result
https://drive.google.com/drive/folders/18ZU6xqZP-zMG4QyvDkNo69Jntygo__gh?usp=sharing
8 gpu result seems like highly input bound compared to 4 gpu. why and how can it be reduced?
Note : Both results are taken using same machine, same code, dataset and hyperparams (except lr) and setting num_threads =32 (cpu_cores : 256)
Thank you
I see that the iteration time it roughly the same ~260ms. Test (I think this is the validation) differ significantly - 1s with 8 GPUs vs 250ms with 4GPUs. Also, the time between the end of training in the epoch and the validation phase is proportional to the number of GPUs, and in the profile, I see that it is spent to instantiate DALI tf.data iterator. Have you tried using Horovod as a solution to run mutli-gpu training (so there is no unnecessary serial load that is proportional to the number of GPUs as each process needs to serve only one GPU)?
Thank you @JanuszL . But I still didn't understand even number of iteration is half compared to 4 gpu, what could be the reason to take high total training time? Also, profiling data shows high input and preprocessing time for 8 gpu training.
1) fn.decoders.image_crop(pngs, device='mixed', output_type=types.RGB, crop=[args.crop_height, args.crop_width]) 2) images = fn.normalize(images) 3) labels = fn.one_hot(labels, axis=-1, num_classes=args.num_classes)
How the above three operations are placed on which device?
Thank you
@PurvangL,
what could be the reason to take high total training time?
I think this is the time needed to instantiate all data loaders instances. And it takes a lot of time as this process is serialized.
Also, profiling data shows high input and preprocessing time for 8 gpu training.
Do you mean the processing itself or the instantiation? I didn't see that in the profile.
I see. Got it. Here input time for 8 gpu vs 4 gpu
8 gpu
4 gpu
I think this analysis accounts for not only processing but iterator creation time. I recommend checking the timeline view.
I see that the iteration time it roughly the same ~260ms. Test (I think this is the validation) differ significantly - 1s with 8 GPUs vs 250ms with 4GPUs. Also, the time between the end of training in the epoch and the validation phase is proportional to the number of GPUs, and in the profile, I see that it is spent to instantiate DALI tf.data iterator. Have you tried using Horovod as a solution to run mutli-gpu training (so there is no unnecessary serial load that is proportional to the number of GPUs as each process needs to serve only one GPU)?
with horovod, I have followed #1852 and getting following error. when I run with np=1; it runs without any error. using np > 1, causing this error. Below output with np=2 .. (BS, H, W, C)
[1,0]<stderr>: TypeError: Inputs to a layer should be tensors. Got: PerReplica:{
[1,0]<stderr>: 0: Tensor("IteratorGetNext:0", shape=(5, 256, 256, 3), dtype=float32),
[1,0]<stderr>: 1: Tensor("IteratorGetNext_1:0", shape=(5, 256, 256, 3), dtype=float32)
[1,0]<stderr>: }
using strategy.scope() instead of "with tf.device('/gpu:0'):" from #1852 made it work. below is the working example.
https://docs.google.com/document/d/1t8SFnSg8Bg7Ze7sJ8d6-VdigI8wtK02cKKYLCYB1Gnw/edit?usp=sharing
scripts run even slower now. Not sure why this error I am getting but script doesn't fail.
Hi @PurvangL,
To be honest, I'm not a TensorFlow expert, so I cannot tell what is the source of this error/warning. Regarding the performance, have you tried capturing the profile?
@JanuszL Thank you. That has made improvements in training time. I will still have to test it on servers. I still have 2 open questions. 1) in DALIDataset, result for num_threads = 32 seems to be best compared to num_threads = 64. 112, 128 of cpu cores (256). How Nvidia set num_threads parameter in their MLPerf submissions? 2) With horovod, I am sharding training data as per number of processes, so that each shard has it's own set of images and labels from entire training data and train the model. But for validation data, I am confused as if I shard and give to each worker their own data, how it would work? for 4 processes, I see 4 different result (loss and ious) and not the similar one at the end of iteration.
Thank you
Hi @PurvangL,
in DALIDataset, result for num_threads = 32 seems to be best compared to num_threads = 64. 112, 128 of cpu cores (256). How Nvidia set num_threads parameter in their MLPerf submissions?
It is mostly done based on the number of CPUs per core available and the requirements from the deep learning framework side (how much CPU power is needed). On top of that, it very much depends on the pipeline itself and the batch size (in most cases DALI doesn't parallelize inside the samples).
With horovod, I am sharding training data as per number of processes, so that each shard has it's own set of images and labels from entire training data and train the model. But for validation data, I am confused as if I shard and give to each worker their own data, how it would work? for 4 processes, I see 4 different result (loss and ious) and not the similar one at the end of iteration.
For validation, you can leave to only one worker or accumulate the results across the shards. I think you will find a couple of good examples on the internet, like in NVIDIA Deep Learning Examples repository.
Thanks @JanuszL . My project contains Horovod, Dali as data preprocessing and tensorflow.fit for running training loop. I devide validation data for each worker so each gets it's own data and for each, DaliDataset object is created and pass to fit function assuming fit function will do accumulation the results across the shards. I use MetricAverageCallback as well from horovod. For me, my val accuracy for same dataset coming 10% less compared to other mode of training ( 1) (hvd + Dali + tf.fit) = val_meaniou - 20% 2) (hvd + tfdataset + tf.fit) = val_meaniou - 28% 3) (Dali + tf.fit) = val_meaniou - 29% 4) (tfdataset + tf.fit) = val_meaniou - 29% ) any example for 1st case or any suggestion to improve result? Thank you
@PurvangL - thank you for the detailed analysis. Do you run full training each time or just validation on a pre-trained model? Also how the result changes with the number of GPUs? Maybe there is a mistake connected with the multi-gpu setup.
@JanuszL I run full training each time with same seed. below is the snapshot of my run result. Note: in both of this result, DaliDataset object is same and only part that is changed is either use tensorflow api (strategy and strategy.distribute_datasets_from_function) or horovod api (like shard dataset and hvd related callbacks with mpi).
Hi @PurvangL,
Thank you for the additional data points. What I was asking for is if the validation part is always the same, otherwise it could be a problem either in training or in the validation. Also, can you verify if the samples returned in both approaches are the same - I don't expect any difference if either hvd_dali or tf_dali cases.
Hello, I am training image semantic segmentation network on multiple gpu (4 gpus). Currently my data pipeline using tf.data.experimental_distribute_dataset from tensorflow as data pipeline and mirrored strategy. I want to use DALI library for faster performance. I have checkout some examples, but non of them helped me. Below is the code that I have added but not working. Any suggestion would be helpful.
Directory Structure:
I have added code below in my example.
Please let me know If any additional information is needed. Thank you