Image semantic segmentation example with tf.data object on multi gpu training

PurvangL commented 2 years ago

Hello, I am training image semantic segmentation network on multiple gpu (4 gpus). Currently my data pipeline using tf.data.experimental_distribute_dataset from tensorflow as data pipeline and mirrored strategy. I want to use DALI library for faster performance. I have checkout some examples, but non of them helped me. Below is the code that I have added but not working. Any suggestion would be helpful.

Directory Structure:

|-- Root
|  |-- train
|  |  |-- images (png images)
|  |  |-- labels   (png label segmentation mask)
|  |-- valid
|  |  |-- images
|  |  |-- labels
|  |-- test
|  |  |-- images
|  |  |-- labels

I have added code below in my example.

@pipeline_def(batch_size=args.batch_size)
def data_pipeline(shard_id):
    pngs = fn.readers.caffe2(
        path=os.path.join(args.dataset, "train", "images"), random_shuffle=True, shard_id=shard_id, num_shards=args.num_gpus)
    labels = fn.readers.caffe2(
        path=os.path.join(args.dataset, "train", "labels"), random_shuffle=True, shard_id=shard_id, num_shards=args.num_gpus)

    images = fn.decoders.image(pngs, device='mixed', output_type=types.RGB)
    labels = fn.decoders.image(labels, device='mixed', output_type=types.RGB)

    # images = fn.crop_mirror_normalize(
    #     images, dtype=types.FLOAT, std=[255.], output_layout="CHW")

   ???????? How can I apply crop without normalize on labels ????????

    return images, labels

shapes = (
    (args.batch_size, args.crop_height, args.crop_width),
    (args.batch_size))

dtypes = (
    tf.float32,
    tf.int32)

def dataset_fn(input_context):
    with tf.device("/gpu:{}".format(input_context.input_pipeline_id)):
        device_id = input_context.input_pipeline_id
        return dali_tf.DALIDataset(
            pipeline=data_pipeline(
                device_id=device_id, shard_id=device_id),
            batch_size=args.batch_size,
            output_shapes=shapes,
            output_dtypes=dtypes,
            device_id=device_id)

input_options = tf.distribute.InputOptions(
    experimental_place_dataset_on_device=True,
    experimental_fetch_to_device=False,
    experimental_replication_mode=tf.distribute.InputReplicationMode.PER_REPLICA)

strategy = tf.distribute.MirroredStrategy(
    [device.name[-5:] for device in tf.config.experimental.list_physical_devices('GPU')][:args.num_gpus])

train_dataset = strategy.distribute_datasets_from_function(dataset_fn, input_options)
val_batches = strategy.distribute_datasets_from_function(dataset_fn, input_options)

Please let me know If any additional information is needed. Thank you

banasraf commented 2 years ago

Hey @PurvangL, could you elaborate on how do you want crop those images/labels? Cropping without normalizing can be achieved with the fn.crop operator. It has the same set of arguments related to crop as fn.crop_mirror_normalize. I guess you want you want to provide the crop argument to be [args.crop_height, args.crop_width]. If you want to randomize the position of the cropping, you might use the fn.random.uniform operators and pass their outputs to the crop_pos_x and crop_pos_y arguments of the fn.crop_mirror_normalize and fn.crop.

PurvangL commented 2 years ago

Hi @banasraf Thank you for your reply. I have modified function as below to apply cropping on images and labels.

@pipeline_def(batch_size=args.batch_size)
def data_pipeline(shard_id):
    pngs = fn.readers.caffe2(
        path=os.path.join(args.dataset, "train", "images"), random_shuffle=True, shard_id=shard_id, num_shards=args.num_gpus)
    labels = fn.readers.caffe2(
        path=os.path.join(args.dataset, "train", "labels"), random_shuffle=True, shard_id=shard_id, num_shards=args.num_gpus)

    images = fn.decoders.image(pngs, device='mixed', output_type=types.RGB)
    labels = fn.decoders.image(labels, device='mixed', output_type=types.RGB)

    # images = fn.crop_mirror_normalize(
    #     images, dtype=types.FLOAT, std=[255.], output_layout="CHW")
    images = fn.crop(images,crop=[args.crop_height, args.crop_width])
    labels = fn.crop(labels,crop=[args.crop_height, args.crop_width])
    # return images, labels.gpu()
    return images, labels

I am getting following error when I run programm.

Traceback (most recent call last):
  File "distributed_training/distributed_training_dali.py", line 344, in <module>
    train_dataset = strategy.distribute_datasets_from_function(dataset_fn, input_options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1160, in distribute_datasets_from_function
    dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 597, in _distribute_datasets_from_function
    options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 168, in get_distributed_datasets_from_function
    input_contexts, dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1566, in __init__
    input_contexts, self._input_workers, dataset_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2301, in _create_datasets_from_function_with_input_context
    dataset = dataset_fn(ctx)
  File "distributed_training/distributed_training_dali.py", line 328, in dataset_fn
    device_id=device_id)
  File "/usr/local/lib/python3.6/dist-packages/nvidia/dali/plugin/tf.py", line 768, in __init__
    if _has_external_source(pipeline):
  File "/usr/local/lib/python3.6/dist-packages/nvidia/dali/external_source.py", line 787, in _has_external_source
    pipeline._build_graph()
  File "/usr/local/lib/python3.6/dist-packages/nvidia/dali/pipeline.py", line 615, in _build_graph
    raise TypeError(f"Illegal pipeline output type. The output {i} contains a nested "
TypeError: Illegal pipeline output type. The output 0 contains a nested `DataNode`. Missing list/tuple expansion (*) is the likely cause.

Could you please let me know how can I fix this? Thank you

PurvangL commented 2 years ago

@banasraf Is there any additional information that I need to provide? please let me know. Thank you

banasraf commented 2 years ago

@PurvangL Okay, I managed to find the issue. caffe2 reader returns files and labels by default, so the output of the operator is a pair (files, labels). This means that in your case the images is a pair of DataNodes. I guess that in your case you don't have any numeric labels for your images, so you might pass the label_type=4 to the reader (see documentation of the operator). This way the operator will return the images only.

banasraf commented 2 years ago

@PurvangL Also, please be sure that the order of labels and images is the same between those two LMDB files. The two readers have the same random seed in your case, so they should generate the same order with random_shuffle but only assuming the same source order in the datasets.

PurvangL commented 2 years ago

@banasraf Thank you for your answer. This is the simplest example I am trying to work with. After adding your suggestions, I am still have issue running it without any error. Below is the link for my example which I am trying to run.

tf_dali.py file consist all the information including dependencies to be installed and command to run.

https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing

please let me know if any additional information is needed.

Regards, Purvang

JanuszL commented 2 years ago

Hi @PurvangL,

I think the file reader operator is the one you are looking for:

@pipeline_def(batch_size=args.batch_size)
def data_pipeline(shard_id):
    images_path = os.path.join(args.dataset, "train", "images")
    labels_path = os.path.join(args.dataset, "train", "labels")
    images_files = os.listdir(images_path)
    labels_files = os.listdir(labels_path)
    images_files = [os.path.join(images_path, f) for f in images_files]
    labels_files = [os.path.join(labels_path, f) for f in labels_files]
    pngs, _ = fn.readers.file(
        files =images_files,
        random_shuffle=True,
        shard_id=shard_id,
        num_shards=args.num_gpus,
        seed=SEED)

    labels, _ = fn.readers.file(
        files = labels_files,
        random_shuffle=True,
        shard_id=shard_id,
        num_shards=args.num_gpus,
        seed=SEED)

    images = fn.decoders.image(pngs, device='mixed', output_type=types.RGB)
    labels = fn.decoders.image(labels, device='mixed', output_type=types.RGB)

    images = fn.crop(images, crop=[args.crop_height, args.crop_width])
    labels = fn.crop(labels, crop=[args.crop_height, args.crop_width])

    return images, labels

Just make sure that the order of files inside images_files and labels_files matches.

PurvangL commented 2 years ago

Thank you @JanuszL . I was able to start training with the changes. But there are two scenarios that I am having difficulties with. 1) Training Time: Now after adding Dali pipeline operator and training on multiple gpu, each iteration and overall training time now increased to 5x compared to when using tf.data.Dataset object input pipeline. What could be wrong here? 2) Batch Size: With tf.data.Dataset object input pipeline, I was able to feed 162 global batch size, for 2 gpus. Now I can only use 6 batch size and if I increase, it throws OOM. Does 6 batch size is local and is it for each Gpu, making global batch size of 12?

Thank you Purvang

JanuszL commented 2 years ago

Hi @PurvangL,

1) I would check the epoch length. Maybe there is some issue with setting the number of shards and shard_id in DALI so each GPU goes over the whole data set instead of only a shard that should be assigned to it. 2) The batch size you set in DALI is per GPU. It is expected that moving data processing to the GPU will limit the memory available for the model, hence you may want to reduce the batch size, however, the overall throughput may increase in some cases.

If you suspect something wrong with the overall performance you may check GPU utilization first and then capture profile.

PurvangL commented 2 years ago

@JanuszL Thanks for your reply. Using DALI, there are 59 iterations to complete each epoch, when batch size is 50 (assuming per Gpu based on last message) Therefore Global batch size: 400 and 8 gpus are used. Using tf.data.Dataset with batch size of 400 which is global batch size (per Gpu bs: 50), only 7 iterations were performed to complete one epoch. How can I correct this?

JanuszL commented 2 years ago

@PurvangL,

Please make sure that shard_id and num_shards are set accordingly for each GPU. If you can share your script of some minimal repro that just iterates over the data set I can check it.

PurvangL commented 2 years ago

Sure @JanuszL

https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing

This link has both script and as well dataset. Please let me know if there is anything missing. Thank you

JanuszL commented 2 years ago

@PurvangL,

I think you need to adjust steps_per_epoch based on the batch size per GPU and the number of GPUs.

PurvangL commented 2 years ago

Thank you for your reply @JanuszL . With this change, I was able to get correct iterations per batch. I am using 2 x A100 40Gb Nvidia Gpus and with using DALI pipeline, I am only able to get per gpu batch size of 4 (Global Bs-8), where using tf.data, I can use 421 Global Batch size, and if dataset was bigger, I could go upto 2600 with image resolution of (128,128).

I have question why this big difference in max batch size using DALI?

Below in Gdrive I have uploaded my test result for CamVid dataset comparing DALI and tf.data pipeline. DALI still seems a bit slower and I see difference in val_mean_iou as well. What could be the reason?

https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing

Please let me know if any additional information needed.

Thank you

JanuszL commented 2 years ago

Maybe TensorFlow competes with DALI for the GPU memory. Can you try limiting TF to use only as much memory as it needs?

PurvangL commented 2 years ago

@JanuszL Thank you for your reply.

After trial and error, I have limited each Gpu memory to 4GB for tensorflow out of 40GB gpu memory, when I use DALI data loading pipeline. Max batch size, I am able to run with tf.data pipeline is 650 (global batch size) for 4xA100, 40GB gpus and max batch size with DALI Data pipeline is 64 (global batch size) for same hardware.

Also, when I check gpu memory usage using "nvidia-smi", For DALI pipeline, 6867 MiB is used out of 40GB for each gpu and for tf.data pipeline, 39739 MiB is used out of 40GB. Below I have attached my test result.

Is there anything that I am missing or any suggestions for training time improvement?

Thank you

JanuszL commented 2 years ago

Hi @PurvangL,

I added:

for gpu in tf.config.list_physical_devices('GPU'):
      tf.config.experimental.set_memory_growth(gpu, True)

at the very beginning of the training script. I can run batch_size 128 with 12GB of GPU memory. Can you try this out on your side?

PurvangL commented 2 years ago

Thank you @JanuszL . First of all, based on your previous reply about tensorflow competing gpu memory,

tf.config.experimental.set_memory_growth(gpu, True)

was already present and I changed it to limit the gpu memory for each gpu using

tf.config.set_logical_device_configuration(gpu, [tf.config.LogicalDeviceConfiguration(memory_limit=4000)])

I was also able to fit 300 global batch size using DALI (150 each gpu, total 2 gpu) and trained on Cityscapes dataset, and 5 epochs training time is 2m6.087s . On the other hand, I was able to use 700 global batch size with tf.data, and train time is 1m31.264s. Trying to understand why training time using DALI data loading is slow? What improvements need to make to train faster?

Thank you

JanuszL commented 2 years ago

Hi @PurvangL,

Based on https://github.com/NVIDIA/DALI/issues/4179#issuecomment-1238613846 I thought that you cannot use batch size greater than 16. The thing that you may want to check is this blog post, so you learn if your training is really limited by the data processing. Also, your data sets consist of png images and DALI doesn't provide a GPU acceleration for them (we use nvJPEG and nvJPEG2000 to GPU accelerate JPEG and JPEG2000 decoding, but there is no GPU accelerated PNG decoder). So maybe in your case, the GPU is already loaded with work and even moving crop operation there doesn't provide much value. Also, did you use the same number of CPU threads in DALI and tf.data (so we know that we are comparing apples to apples here).

PurvangL commented 2 years ago

@JanuszL . Thank you for the answer. point noted regarding the image format.

https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing

This gdrive location contains both scripts, one with DALI and other one with tf.data.

My goal is to see whether using DALI can reduce training time or not?

Note: I have duplicated the images to test with higher batch size as my main goal is to reduce the training time. I have not updated gdrive dataset with duplicated images and it still contains ~421 training images.

Please let me know if anything needed from my end.

JanuszL commented 2 years ago

Hi @PurvangL,

Can you capture the profile according to this guide and see what the timeline looks like?

PurvangL commented 2 years ago

Hi @JanuszL I have rerun both the script with profiling. I have added profiling data in mentioned google drive link. Could you please checkout and explain your analysis and also recommended steps?

Thank you

JanuszL commented 2 years ago

@PurvangL,

The main difference I see is that TF uses all available CPU cores while DALI sticks to the default value in your case. You can use num_threads in the DALIDataset so assign more threads to DALI to speed up the png decoding (which is CPU bound). The rest of the processing is almost negligible and there is a little speed up from moving it to the GPU (in your case it is only about crop).

PurvangL commented 2 years ago

@JanuszL . Indeed that solved the problem. So in summary, I was trying to compare 1) 4xA100 (80GB each) PCIe machine 2) 4xA100 (80GB each) SXM4 machine and compare them for training time and inference time. Using MLperf submission by Nvidia, SXM machine was definitely faster, but when I used with other algorithms not submitted in MLperf, the result was other way. By analyzing profiling result, SXM machine was taking more time for kernel launch (tf profiling) and memcpy operations (ncu profiling)(Any reason why?), which I felt could be the reason. Integrating DALI to algorithm did help and made SXM machine training faster. Any other tricks or feature that I could integrate or you recommend to use SXM machine for it's full potential?

Thank you

JanuszL commented 2 years ago

Hi @PurvangL,

Can you check if the CPU and GPU clocks are set the same way on both machines? For example, this GPU talk can provide some guidance.

PurvangL commented 2 years ago

@JanuszL Thank you for the reply. I will. Meanwhile I am facing new problem after integrating DALI. Getting poor results (both training time as well as mean_iou) when increasing batch size or number of gpus for training. Below is my test result. I am linearly scaling learning rate with number of gpus.

what could be the reason?

Thanks

JanuszL commented 2 years ago

Hi @PurvangL,

Can you confirm that the number of iterations decreases with the increase of the batch size of the number of GPUs? Can you also do a dry run to make sure that all dataset samples are returned during one epoch?

PurvangL commented 2 years ago

Hi @JanuszL Yes, I confirm with increasing batch size in number of GPUS, iterations does reduces linearly. I have two questions here. when I use num_threads = cpu_count(), 1) in same machine, increasing batch size with number of Gpu taking more time for same dataset and for same number of epochs. I confirm iteration reduces linearly with increment in batch size and Gpu. 2) machine with 2x Gpu memory and double batch size taking more time.

below I have attached my result.

when I use num_threads = 64, result does change, but 8 gpu is still concern.

Do I need to find optimal num_threads ?

Thank you

JanuszL commented 2 years ago

Hi @PurvangL,

Can you check what is the iteration time? How does it change with the number of GPUs (I would expect it to remain similar)? Can you also tell what is in the Total_Epochs column? I would expect that the number of epochs (time you train over the whole data set) remains the same compared to the number of GPUs used and the batch size.

PurvangL commented 2 years ago

Hi @JanuszL . Thank you for your reply. I am currently facing issue of running for longer time when increasing number of GPUs and batch size in 8xA100 80G server. except this, all servers (PCIe, SXM) are getting result as expected.

Iteration time: 2 GPUs : 7s 4 GPUs : 5s 8 GPUs : 5s

Should be 8 GPUs iteration time less compared to 4 Gpus?

JanuszL commented 2 years ago

Hi @PurvangL,

I would expect that the iteration time would slightly increase when you increase the number of GPUs (synchronization and data exchange overhead), but the total number of iterations decreases and that is the gain. I would capture the profiles from both cases and compare them.

PurvangL commented 2 years ago

Hi @JanuszL

I have captured the profiling data for both 4 gpu and 8 gpu training.

Here I have attached the result

https://drive.google.com/drive/folders/18ZU6xqZP-zMG4QyvDkNo69Jntygo__gh?usp=sharing

8 gpu result seems like highly input bound compared to 4 gpu. why and how can it be reduced?

Note : Both results are taken using same machine, same code, dataset and hyperparams (except lr) and setting num_threads =32 (cpu_cores : 256)

Thank you

JanuszL commented 2 years ago

I see that the iteration time it roughly the same ~260ms. Test (I think this is the validation) differ significantly - 1s with 8 GPUs vs 250ms with 4GPUs. Also, the time between the end of training in the epoch and the validation phase is proportional to the number of GPUs, and in the profile, I see that it is spent to instantiate DALI tf.data iterator. Have you tried using Horovod as a solution to run mutli-gpu training (so there is no unnecessary serial load that is proportional to the number of GPUs as each process needs to serve only one GPU)?

PurvangL commented 2 years ago

Thank you @JanuszL . But I still didn't understand even number of iteration is half compared to 4 gpu, what could be the reason to take high total training time? Also, profiling data shows high input and preprocessing time for 8 gpu training.

1) fn.decoders.image_crop(pngs, device='mixed', output_type=types.RGB, crop=[args.crop_height, args.crop_width]) 2) images = fn.normalize(images) 3) labels = fn.one_hot(labels, axis=-1, num_classes=args.num_classes)

How the above three operations are placed on which device?

Thank you

JanuszL commented 2 years ago

@PurvangL,

what could be the reason to take high total training time?

I think this is the time needed to instantiate all data loaders instances. And it takes a lot of time as this process is serialized.

Also, profiling data shows high input and preprocessing time for 8 gpu training.

Do you mean the processing itself or the instantiation? I didn't see that in the profile.

PurvangL commented 2 years ago

I see. Got it. Here input time for 8 gpu vs 4 gpu

8 gpu

4 gpu

JanuszL commented 2 years ago

I think this analysis accounts for not only processing but iterator creation time. I recommend checking the timeline view.

PurvangL commented 2 years ago

I see that the iteration time it roughly the same ~260ms. Test (I think this is the validation) differ significantly - 1s with 8 GPUs vs 250ms with 4GPUs. Also, the time between the end of training in the epoch and the validation phase is proportional to the number of GPUs, and in the profile, I see that it is spent to instantiate DALI tf.data iterator. Have you tried using Horovod as a solution to run mutli-gpu training (so there is no unnecessary serial load that is proportional to the number of GPUs as each process needs to serve only one GPU)?

with horovod, I have followed #1852 and getting following error. when I run with np=1; it runs without any error. using np > 1, causing this error. Below output with np=2 .. (BS, H, W, C)

[1,0]<stderr>:    TypeError: Inputs to a layer should be tensors. Got: PerReplica:{
[1,0]<stderr>:      0: Tensor("IteratorGetNext:0", shape=(5, 256, 256, 3), dtype=float32),
[1,0]<stderr>:      1: Tensor("IteratorGetNext_1:0", shape=(5, 256, 256, 3), dtype=float32)
[1,0]<stderr>:    }

using strategy.scope() instead of "with tf.device('/gpu:0'):" from #1852 made it work. below is the working example.

https://docs.google.com/document/d/1t8SFnSg8Bg7Ze7sJ8d6-VdigI8wtK02cKKYLCYB1Gnw/edit?usp=sharing

scripts run even slower now. Not sure why this error I am getting but script doesn't fail.

JanuszL commented 2 years ago

Hi @PurvangL,

To be honest, I'm not a TensorFlow expert, so I cannot tell what is the source of this error/warning. Regarding the performance, have you tried capturing the profile?

PurvangL commented 2 years ago

@JanuszL Thank you. That has made improvements in training time. I will still have to test it on servers. I still have 2 open questions. 1) in DALIDataset, result for num_threads = 32 seems to be best compared to num_threads = 64. 112, 128 of cpu cores (256). How Nvidia set num_threads parameter in their MLPerf submissions? 2) With horovod, I am sharding training data as per number of processes, so that each shard has it's own set of images and labels from entire training data and train the model. But for validation data, I am confused as if I shard and give to each worker their own data, how it would work? for 4 processes, I see 4 different result (loss and ious) and not the similar one at the end of iteration.

Thank you

JanuszL commented 2 years ago

Hi @PurvangL,

in DALIDataset, result for num_threads = 32 seems to be best compared to num_threads = 64. 112, 128 of cpu cores (256). How Nvidia set num_threads parameter in their MLPerf submissions?

It is mostly done based on the number of CPUs per core available and the requirements from the deep learning framework side (how much CPU power is needed). On top of that, it very much depends on the pipeline itself and the batch size (in most cases DALI doesn't parallelize inside the samples).

With horovod, I am sharding training data as per number of processes, so that each shard has it's own set of images and labels from entire training data and train the model. But for validation data, I am confused as if I shard and give to each worker their own data, how it would work? for 4 processes, I see 4 different result (loss and ious) and not the similar one at the end of iteration.

For validation, you can leave to only one worker or accumulate the results across the shards. I think you will find a couple of good examples on the internet, like in NVIDIA Deep Learning Examples repository.

PurvangL commented 2 years ago

Thanks @JanuszL . My project contains Horovod, Dali as data preprocessing and tensorflow.fit for running training loop. I devide validation data for each worker so each gets it's own data and for each, DaliDataset object is created and pass to fit function assuming fit function will do accumulation the results across the shards. I use MetricAverageCallback as well from horovod. For me, my val accuracy for same dataset coming 10% less compared to other mode of training ( 1) (hvd + Dali + tf.fit) = val_meaniou - 20% 2) (hvd + tfdataset + tf.fit) = val_meaniou - 28% 3) (Dali + tf.fit) = val_meaniou - 29% 4) (tfdataset + tf.fit) = val_meaniou - 29% ) any example for 1st case or any suggestion to improve result? Thank you

JanuszL commented 2 years ago

@PurvangL - thank you for the detailed analysis. Do you run full training each time or just validation on a pre-trained model? Also how the result changes with the number of GPUs? Maybe there is a mistake connected with the multi-gpu setup.

PurvangL commented 2 years ago

@JanuszL I run full training each time with same seed. below is the snapshot of my run result. Note: in both of this result, DaliDataset object is same and only part that is changed is either use tensorflow api (strategy and strategy.distribute_datasets_from_function) or horovod api (like shard dataset and hvd related callbacks with mpi).

JanuszL commented 2 years ago

Hi @PurvangL,

Thank you for the additional data points. What I was asking for is if the validation part is always the same, otherwise it could be a problem either in training or in the validation. Also, can you verify if the samples returned in both approaches are the same - I don't expect any difference if either hvd_dali or tf_dali cases.

NVIDIA / DALI

Image semantic segmentation example with tf.data object on multi gpu training #4179