NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
937 stars 200 forks source link

[Question] Does HugeCtr support H800 GPU? #414

Closed sparkling9809 closed 1 year ago

sparkling9809 commented 1 year ago

I run the embedding_test in HugeCtr on H800, but it failed, the exception follow is :

root@jupyuterlab-nb-1691543551529-ddf9dcb96-jr455:/usr/local/hugectr/bin# ./embedding_test Running main() from /hugectr/third_party/googletest/googletest/src/gtest_main.cc [==========] Running 278 tests from 28 test suites. [----------] Global test environment set-up. [----------] 28 tests from distributed_sparse_embedding_hash_test [ RUN ] distributed_sparse_embedding_hash_test.fp32_sgd_1gpu MpiInitService: MPI was already initialized by another (non-HugeCTR) mechanism. [HCTR][09:27:01.919][INFO][RK0][main]: Global seed is 1544237699 [HCTR][09:27:01.994][INFO][RK0][main]: Device to NUMA mapping: GPU 0 -> node 1 [HCTR][09:27:02.470][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled. [HCTR][09:27:02.470][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 76.5250 [HCTR][09:27:02.470][INFO][RK0][main]: Start all2all warmup [HCTR][09:27:02.471][INFO][RK0][main]: End all2all warmup

[HCTR][09:27:02.757][INFO][RK0][main]: train_file_list.txt done! [HCTR][09:27:02.757][INFO][RK0][main]: ./data_reader_test_data exist

[HCTR][09:27:02.828][INFO][RK0][main]: test_file_list.txt done! [HCTR][09:27:02.828][DEBUG][RK0][main]: [device 0] allocating 0.0012 GB, available 76.2593 [HCTR][09:27:02.828][DEBUG][RK0][main]: [device 0] allocating 0.0030 GB, available 76.2554

[HCTR][09:27:03.184][ERROR][RK0][main]: CUDA RT call "cudaGetLastError()" in line 341 of file /hugectr/HugeCTR/include/hashtable/cudf/concurrent_unordered_map.cuh failed with no kernel image is available for execution on the device (209). root@jupyuterlab-nb-1691543551529-ddf9dcb96-jr455:/usr/local/hugectr/bin#

the cuda version: 12.2 HugeCtr docker image : Merlin-hugectr:23.02

EmmaQiaoCh commented 1 year ago

Hi, thanks for trying HugeCTR. Could you use our latest image 23.06? Thanks.

sparkling9809 commented 1 year ago

Yes. The problem above has solved when I changed the image version to 23.06; But when I run the trainning on multiple H800 gpu, there is a new problem:

[HCTR][06:01:46.761][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/2.txt-------------------- [HCTR][06:01:46.826][INFO][RK0][main]: Preparing embedding table for next pass [HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28) [HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28) terminate called after throwing an instance of 'cudf::fatal_cuda_error' what(): Fatal CUDA error encountered at: /opt/rapids/src/cudf/cpp/include/cudf/detail/utilities/pinned_allocator.hpp:170: 700 cudaErrorIllegalAddress an illegal memory access was encountered [jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Process received signal [jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal: Aborted (6) [jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal code: (-6) Traceback (most recent call last): File "dcn_init_train.py", line 175, in

I found the method in hugectr readme doc:

NOTE: HugeCTR uses NCCL to share data between ranks, and NCCL may requires shared memory for IPC and pinned (page-locked) system memory resources. It is recommended that you increase these resources by issuing the following options in the docker run command.

-shm-size=1g -ulimit memlock=-1

I have tried the method, but the problem doesn't disappear.

sparkling9809 commented 1 year ago

Is there any progresses for this question?

shijieliu commented 1 year ago

hi @sparkling9809 which training script are you using? From the log I can tell you are trying to use Embedding Training Cache, is this expected?

If you want to try some sample and not require Embedding Training Cache, we advise you to try EmbeddingCollection, which is our latest embedding implementation and old ones will be deprecated in the future. Here is the doc and sample for reference.

sparkling9809 commented 1 year ago

hi @sparkling9809 which training script are you using? From the log I can tell you are trying to use Embedding Training Cache, is this expected?

If you want to try some sample and not require Embedding Training Cache, we advise you to try EmbeddingCollection, which is our latest embedding implementation and old ones will be deprecated in the future. Here is the doc and sample for reference.

Thanks for your reply!

The script for trainnign as follows:


import argparse
import hugectr
from mpi4py import MPI
import time
from tools.utils import Log
logger = Log(__name__).getlog()

arg_parser = argparse.ArgumentParser(description="model train")
arg_parser.add_argument("--features_num", type=int, required=True)
arg_parser.add_argument("--check", type=str, required=True)
args = arg_parser.parse_args()
total_num = args.features_num*16

all_solt = [10000,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            50,50,110000000,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            10000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
# for i in range(len(all_solt)):
#     if all_solt[i] < 1:
#         all_solt[i] = 10000
# all_solt[-2] = 0
All_Solt = all_solt[:args.features_num]
# All_Solt = all_solt
if args.check == "solo":
    # Source = ["./testdata0717/train_solo/0.txt", "./testdata0717/train_solo/1.txt", "./testdata0717/train_solo/2.txt"]
    # Keyset = ["./testdata0717/train_solo/0.keyset", "./testdata0717/train_solo/1.keyset", "./testdata0717/train_solo/2.keyset"]
    # Source = ["./testdata0717/train_all_mini/0.txt"]
    # Keyset = ["./testdata0717/train_all_mini/all.keyset"]
    # Source = ["/home/workspace/hy/tmp/"+str(i)+".txt" for i in range(3)]
    # Keyset = ["/home/workspace/hy/tmp/" + str(i) + ".keyset" for i in range(3)]
    Source=["/root/91feature_data/91_keyset/"+str(i)+".txt" for i in range(3)]
    Keyset = ["/root/91feature_data/91_keyset/" + str(i) + ".keyset" for i in range(3) ]
elif args.check == "all":
    # Source = ["./testdata0717/train_all/all.txt"]
    # Keyset = ["./testdata0717/train_all/all.keyset"]
    Source = ["/root/91_keyset/0.txt"]
    Keyset = ["/root/91_keyset/0.keyset"]
else:
    raise ValueError("check类型错误, 请输入 solo 或 all")
# logger.info(f"特征个数为: {args.features_num}")

solver = hugectr.CreateSolver(model_name = "wd2kw_seq",
                              max_eval_batches = 5000,
                              batchsize_eval = 36000,
                              # batchsize = 10240,
                              batchsize = 36000,
                              #batchsize = 1000,
                              lr = 0.001, 
                              vvgpu = [[0,1,2]],
                              i64_input_key = True,
                              use_mixed_precision = False,
                              repeat_dataset = False,
                              use_cuda_graph = True
                              # kafka_brockers = "10.68.225.168:9092,10.68.226.229:9092,10.68.227.181:9092"
                             )

reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = Source, keyset = Keyset,
                                  eval_source="/root/91feature_data/91_keyset/eval.txt",
                                  num_workers=30,
                                  slot_size_array=All_Solt,
                                  check_type = hugectr.Check_t.Sum)
# reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
#                                   source = Source, keyset = Keyset,
#                                   eval_source="/home/workspace/hy/tmp/0.txt",
#                                   num_workers=30,
#                                   slot_size_array=All_Solt,
#                                   check_type = hugectr.Check_t.Sum)

optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)

etc = hugectr.CreateETC(ps_types = [hugectr.TrainPSType_t.Staged],
                        sparse_models = ["/root/wd2kw_seq_0_sparse_model"],\
                        local_paths = ["/root/"])

model = hugectr.Model(solver, reader, optimizer, etc)
model.add(hugectr.Input(label_dim = 1, label_name = "if_click",
                        dense_dim = 0, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("data1", 1, False, args.features_num)]))

model.add(
    hugectr.SparseEmbedding(
        embedding_type=hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
        workspace_size_per_gpu_in_mb=26700,
        embedding_vec_size=16,
        combiner="sum",
        sparse_embedding_name="sparse_embedding1",
        bottom_name="data1",
        optimizer=optimizer,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Reshape,
        bottom_names=["sparse_embedding1"],
        top_names=["reshape1"],
        leading_dim=total_num,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.MultiCross,
        bottom_names=["reshape1"],
        top_names=["multicross1"],
        num_layers=6,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["reshape1"],
        top_names=["fc1"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc1"], top_names=["relu1"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu1"],
        top_names=["dropout1"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout1"],
        top_names=["fc2"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc2"], top_names=["relu2"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu2"],
        top_names=["dropout2"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Concat,
        bottom_names=["dropout2", "multicross1"],
        top_names=["concat2"],
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["concat2"],
        top_names=["fc3"],
        num_output=1,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.BinaryCrossEntropyLoss,
        bottom_names=["fc3", "if_click"],
        top_names=["loss"],
    )
)

model.compile()
model.summary()
model.graph_to_json(graph_config_file = "wd2kw_seq.json")
#model.save_params_to_files("wdl")
model.fit(num_epochs = 1, display = 500, eval_interval = 100)

model.save_params_to_files("/root/wd2kw_seq")
s_to_files("/root/wd2kw_seq")

The script runs ok on 8 H800 GPU in single machine. But there is something wrong when the source include files greater than 3.

reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = Source,      // When the number of files included in source greater than 3, the wrong will happen!
                                   keyset = Keyset,     
                                  eval_source="/root/91feature_data/91_keyset/eval.txt",
                                  num_workers=30,
                                  slot_size_array=All_Solt,
                                  check_type = hugectr.Check_t.Sum)

the exception as follows:

[HCTR][06:01:46.761][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/2.txt-------------------- [HCTR][06:01:46.826][INFO][RK0][main]: Preparing embedding table for next pass [HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28) [HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28) terminate called after throwing an instance of 'cudf::fatal_cuda_error' what(): Fatal CUDA error encountered at: /opt/rapids/src/cudf/cpp/include/cudf/detail/utilities/pinned_allocator.hpp:170: 700 cudaErrorIllegalAddress an illegal memory access was encountered [jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Process received signal [jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal: Aborted (6) [jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal code: (-6) Traceback (most recent call last): File "dcn_init_train.py", line 175,

JacoCheung commented 1 year ago

closed as it's a duplication of #417 .