NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
950 stars 200 forks source link

[BUG] Encountered ETC error of din model when training with multiple keyset. #429

Closed dusir closed 11 months ago

dusir commented 1 year ago

Describe the bug

We were trying to enable ETC based on din sample in the hugectr repo and train with our in-house data.

However, we found out that if the dataset was preprocessed into multiple sources, for example

source = ['/root/keyset_dir/din_1k_seq3_v1/0/0.txt', '/root/keyset_dir/din_1k_seq3_v1/1/0.txt']
keyset = ['/root/keyet/0.keyset', '/root/keyet/0.keyset']

such an error would occur

[HCTR][06:21:09.120][INFO][RK0][main]: synchronize  done.
[HCTR][06:21:10.185][ERROR][RK0][main]: Runtime error: invalid argument
    cudaMemPrefetchAsync(uvm_key_per_gpu[id], key_size_in_B, ((int)-1), embedding_data_.get_local_gpu(id).get_stream()) at load_parameters (/home/HugeCTR/HugeCTR/src/embeddings/distributed_slot_sparse_embedding_hash.cu:487)
[HCTR][06:21:10.186][ERROR][RK0][main]: Runtime error: invalid argument
    cudaMemPrefetchAsync(uvm_key_per_gpu[id], key_size_in_B, ((int)-1), embedding_data_.get_local_gpu(id).get_stream()) at load_parameters (/home/HugeCTR/HugeCTR/src/embeddings/distributed_slot_sparse_embedding_hash.cu:487)

It worked well with single source.

We used to face other error with multiple sources linked issue.

To Reproduce

run the script

JacoCheung commented 1 year ago

The error msgs:

[HCTR][06:21:09.120][INFO][RK0][main]: synchronize  done.
[HCTR][06:21:10.185][ERROR][RK0][main]: Runtime error: invalid argument
    cudaMemPrefetchAsync(uvm_key_per_gpu[id], key_size_in_B, ((int)-1), embedding_data_.get_local_gpu(id).get_stream()) at load_parameters (/home/HugeCTR/HugeCTR/src/embeddings/distributed_slot_sparse_embedding_hash.cu:487)
[HCTR][06:21:10.186][ERROR][RK0][main]: Runtime error: invalid argument
    cudaMemPrefetchAsync(uvm_key_per_gpu[id], key_size_in_B, ((int)-1), embedding_data_.get_local_gpu(id).get_stream()) at load_parameters (/home/HugeCTR/HugeCTR/src/embeddings/distributed_slot_sparse_embedding_hash.cu:487)
dusir commented 1 year ago

the test cmd is as follows:

din_seq3.py --model_name din_1k_seq3_v3_modify_v2 --keyset_dir '/root/keyset_dir' --batch_size 36000 --batchsize_eval 36000 --gpus '0,1,2,3,4,5,6,7' --train_dir '/data' --start_date '20231012' --end_date '20231013' --datePath '20231107' --workspace_size_per_gpu_in_mb 1200 --num_workers 30

the file din_seq3.py is same as the samples/din/din_parquet.py,we just add some args for test.

JacoCheung commented 11 months ago

Close as ETC is already deprecated.