Closed ZhuYuJin closed 1 year ago
Hi @ZhuYuJin , I think this probable is hostname:merlin-tensorflow_sok
contains "" , when read the wiki , it says ` a hostname may not contain other characters, such as the underscore character (),`, so could you change your hostname and have a try? FYI @bashimao
Hi @ZhuYuJin , I think this probable is hostname:
merlin-tensorflow_sok
contains "" , when read the wiki , it says ` a hostname may not contain other characters, such as the underscore character (),`, so could you change your hostname and have a try? FYI @bashimao
The error is gone after I changed my container name.
But I encountered another error when I run run_sok_MirroredStrategy.py
. My test node has two NVIDIA A10s.
...
2023-04-06 08:16:30.301888: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
key: "Toutput_types"
value {
list {
type: DT_INT64
type: DT_INT64
}
}
}
attr {
key: "_cardinality"
value {
i: 1966080
}
}
attr {
key: "is_files"
value {
b: false
}
}
attr {
key: "metadata"
value {
s: "\n\024TensorSliceDataset:0"
}
}
attr {
key: "output_shapes"
value {
list {
shape {
dim {
size: 100
}
dim {
size: 10
}
}
shape {
dim {
size: 1
}
}
}
}
}
experimental_type {
type_id: TFT_PRODUCT
args {
type_id: TFT_DATASET
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_INT64
}
}
args {
type_id: TFT_TENSOR
args {
type_id: TFT_INT64
}
}
}
}
}
You are using the plugin with MirroredStrategy.
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:107] Mapping from local_replica_id to device_id:
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:109] 0 -> 0
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:109] 1 -> 1
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:84] Global seed is 782420126
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:85] Local GPU Count: 2
2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:86] Global GPU Count: 2
2023-04-06 08:16:31.768991: I 2023-04-06 08:16:31.768991: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:127] Global Replica Id: 1; Local Replica Id: 1
sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:127] Global Replica Id: 0; Local Replica Id: 0
2023-04-06 08:16:33.768993: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:200] Not all peer to peer access enabled.
2023-04-06 08:16:34.768994: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_manager.cc:132] Created embedding variable whose name is EmbeddingVariable
2023-04-06 08:16:34.768994: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_manager.cc:132] Created embedding variable whose name is EmbeddingVariable/replica_1/
2023-04-06 08:16:41.769001: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_param.cc:121] Variable: EmbeddingVariable on global_replica_id: 0 start initialization
2023-04-06 08:16:41.769001: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_param.cc:138] Variable: EmbeddingVariable on global_replica_id: 0 initialization done.
2023-04-06 08:16:41.769001: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_param.cc:121] Variable: EmbeddingVariable on global_replica_id: 1 start initialization
2023-04-06 08:16:41.769001: I sparse_operation_kit/kit_cc/kit_cc_infra/src/parameters/raw_param.cc:138] Variable: EmbeddingVariable on global_replica_id: 1 initialization done.
2023-04-06 08:16:41.769001: I sparse_operation_kit/kit_cc/kit_cc_infra/src/facade.cc:253] SparseOperationKit allocated internal memory.
[merlin-tensorflow-sok-org:4600 :0:4964] Caught signal 7 (Bus error: nonexistent physical address)
Hi @ZhuYuJin ,sorry to response you later , because 2 A10 cards machine is difficult to find ,and I found 8 cards A10s machine, and use 2 of the 8 cards, unfortunately ,this machine maybe also not same as your machine ,because every A10 card can access others, but in your machine ,2 A10 can't access with each other,you can see image as below: I think you follow this guide to test:https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/sparse_operation_kit/documents/tutorials/DenseDemo, and I have test this guide . First time ,I follow the guide step:
$ python3 gen_data.py \
--global_batch_size=65536 \
--slot_num=100 \
--nnz_per_slot=10 \
--iter_num=30
$ python3 split_data.py \
--filename="./data.file" \
--split_num=8 \
--save_prefix="./data_"
$ python3 run_sok_MirroredStrategy.py \
--data_filename="./data.file" \
--global_batch_size=65536 \
--max_vocabulary_size_per_gpu=8192 \
--slot_num=100 \
--nnz_per_slot=10 \
--num_dense_layers=6 \
--embedding_vec_size=4 \
--optimizer="adam"
and can't run ,it is raise OOM.
Second time , I change some arguements value:
$ python3 gen_data.py \
--global_batch_size=16384\
--slot_num=100 \
--nnz_per_slot=10 \
--iter_num=30
$ python3 split_data.py \
--filename="./data.file" \
--split_num=8 \
--save_prefix="./data_"
$ python3 run_sok_MirroredStrategy.py \
--data_filename="./data.file" \
--global_batch_size=16384\
--max_vocabulary_size_per_gpu=8192 \
--slot_num=100 \
--nnz_per_slot=10 \
--num_dense_layers=6 \
--embedding_vec_size=4 \
--optimizer="adam"
and it runs well.
because my machine is not same with your machine,I'm not sure if your error is OOM. And I also face a OOM error by follow the guide default step, so I recommand you retest my second time step to check if the error still exist.
Hi @ZhuYuJin , was the problem resolved?
because this issue is not response for a long time , I will close it . @ZhuYuJin if you still have problem , please reopen it, thank you.
Describe the bug I try to run the sok demo
https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/sparse_operation_kit/documents/tutorials/DenseDemo
in the docker image provided in documenthttps://nvidia-merlin.github.io/HugeCTR/sparse_operation_kit/master/intro_link.html#installation
, and encounter the following error.To Reproduce