[BUG] cannot run sok demo with official image

ZhuYuJin commented 1 year ago

Describe the bug I try to run the sok demo https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/sparse_operation_kit/documents/tutorials/DenseDemo in the docker image provided in document https://nvidia-merlin.github.io/HugeCTR/sparse_operation_kit/master/intro_link.html#installation, and encounter the following error.

To Reproduce

docker run nvcr.io/nvidia/merlin/merlin-tensorflow:22.09

git clone https://github.com/NVIDIA-Merlin/HugeCTR.git

cd HugeCTR/sparse_operation_kit/documents/tutorials/DenseDemo

python3 gen_data.py \
    --global_batch_size=65536 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --iter_num=30 

python3 split_data.py \
    --filename="./data.file" \
    --split_num=8 \
    --save_prefix="./data_"

horovodrun -np 8 -H localhost:8 \
    python3 run_sok_horovod.py \
    --data_filename_prefix="./data_" \
    --global_batch_size=65536 \
    --max_vocabulary_size_per_gpu=1024 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --optimizer="adam"

kanghui0204 commented 1 year ago

Hi @ZhuYuJin ， I think this probable is hostname:merlin-tensorflow_sok contains "" , when read the wiki , it says ` a hostname may not contain other characters, such as the underscore character (),`, so could you change your hostname and have a try? FYI @bashimao

ZhuYuJin commented 1 year ago

Hi @ZhuYuJin ， I think this probable is hostname:merlin-tensorflow_sok contains "" , when read the wiki , it says ` a hostname may not contain other characters, such as the underscore character (),`, so could you change your hostname and have a try? FYI @bashimao

The error is gone after I changed my container name.

But I encountered another error when I run run_sok_MirroredStrategy.py. My test node has two NVIDIA A10s.

...
2023-04-06 08:16:30.301888: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_INT64
      type: DT_INT64
    }
  }
}
attr {
  key: "_cardinality"
  value {
    i: 1966080
  }
}
attr {
  key: "is_files"
  value {
    b: false
  }
}
attr {
  key: "metadata"
  value {
    s: "\n\024TensorSliceDataset:0"
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: 100
        }
        dim {
          size: 10
        }
      }
      shape {
        dim {
          size: 1
        }
      }
    }
  }
}
experimental_type {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_DATASET
    args {
      type_id: TFT_PRODUCT
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_INT64
        }
      }
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_INT64
        }
      }
    }
  }
}

You are using the plugin with MirroredStrategy.
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:107] Mapping from local_replica_id to device_id:
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:109] 0 -> 0
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:109] 1 -> 1
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:84] Global seed is 782420126
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:85] Local GPU Count: 2
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:86] Global GPU Count: 2
2023-04-06 08:16:31.768991: I 2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:127] Global Replica Id: 1; Local Replica Id: 1
sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:127] Global Replica Id: 0; Local Replica Id: 0
2023-04-06 08:16:33.768993: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:200] Not all peer to peer access enabled.
2023-04-06 08:16:34.768994: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_manager.cc:132] Created embedding variable whose name is EmbeddingVariable
2023-04-06 08:16:34.768994: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_manager.cc:132] Created embedding variable whose name is EmbeddingVariable/replica_1/
2023-04-06 08:16:41.769001: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_param.cc:121] Variable: EmbeddingVariable on global_replica_id: 0 start initialization
2023-04-06 08:16:41.769001: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_param.cc:138] Variable: EmbeddingVariable on global_replica_id: 0 initialization done.
2023-04-06 08:16:41.769001: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_param.cc:121] Variable: EmbeddingVariable on global_replica_id: 1 start initialization
2023-04-06 08:16:41.769001: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_param.cc:138] Variable: EmbeddingVariable on global_replica_id: 1 initialization done.
2023-04-06 08:16:41.769001: I sparse_operation_kit/kit_cc/kit_cc_infra/src/facade.cc:253] SparseOperationKit allocated internal memory.
[merlin-tensorflow-sok-org:4600 :0:4964] Caught signal 7 (Bus error: nonexistent physical address)

kanghui0204 commented 1 year ago

Hi @ZhuYuJin ，sorry to response you later , because 2 A10 cards machine is difficult to find ,and I found 8 cards A10s machine, and use 2 of the 8 cards, unfortunately ,this machine maybe also not same as your machine ,because every A10 card can access others, but in your machine ,2 A10 can't access with each other,you can see image as below: 1681888139960 I think you follow this guide to test:https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/sparse_operation_kit/documents/tutorials/DenseDemo, and I have test this guide . First time ,I follow the guide step:

$ python3 gen_data.py \
    --global_batch_size=65536 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --iter_num=30 

$ python3 split_data.py \
    --filename="./data.file" \
    --split_num=8 \
    --save_prefix="./data_"

$ python3 run_sok_MirroredStrategy.py \
    --data_filename="./data.file" \
    --global_batch_size=65536 \
    --max_vocabulary_size_per_gpu=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --optimizer="adam"

and can't run ,it is raise OOM.

Second time , I change some arguements value:

$ python3 gen_data.py \
    --global_batch_size=16384\
    --slot_num=100 \
    --nnz_per_slot=10 \
    --iter_num=30 

$ python3 split_data.py \
    --filename="./data.file" \
    --split_num=8 \
    --save_prefix="./data_"

$ python3 run_sok_MirroredStrategy.py \
    --data_filename="./data.file" \
    --global_batch_size=16384\
    --max_vocabulary_size_per_gpu=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --optimizer="adam"

and it runs well.

because my machine is not same with your machine,I'm not sure if your error is OOM. And I also face a OOM error by follow the guide default step, so I recommand you retest my second time step to check if the error still exist.

EmmaQiaoCh commented 1 year ago

Hi @ZhuYuJin , was the problem resolved?

kanghui0204 commented 1 year ago

because this issue is not response for a long time , I will close it . @ZhuYuJin if you still have problem , please reopen it, thank you.

NVIDIA-Merlin / HugeCTR

[BUG] cannot run sok demo with official image #383