Closed dusir closed 11 months ago
The error msgs:
[HCTR][06:21:09.120][INFO][RK0][main]: synchronize done.
[HCTR][06:21:10.185][ERROR][RK0][main]: Runtime error: invalid argument
cudaMemPrefetchAsync(uvm_key_per_gpu[id], key_size_in_B, ((int)-1), embedding_data_.get_local_gpu(id).get_stream()) at load_parameters (/home/HugeCTR/HugeCTR/src/embeddings/distributed_slot_sparse_embedding_hash.cu:487)
[HCTR][06:21:10.186][ERROR][RK0][main]: Runtime error: invalid argument
cudaMemPrefetchAsync(uvm_key_per_gpu[id], key_size_in_B, ((int)-1), embedding_data_.get_local_gpu(id).get_stream()) at load_parameters (/home/HugeCTR/HugeCTR/src/embeddings/distributed_slot_sparse_embedding_hash.cu:487)
the test cmd is as follows:
din_seq3.py --model_name din_1k_seq3_v3_modify_v2 --keyset_dir '/root/keyset_dir' --batch_size 36000 --batchsize_eval 36000 --gpus '0,1,2,3,4,5,6,7' --train_dir '/data' --start_date '20231012' --end_date '20231013' --datePath '20231107' --workspace_size_per_gpu_in_mb 1200 --num_workers 30
the file din_seq3.py is same as the samples/din/din_parquet.py,we just add some args for test.
Close as ETC is already deprecated.
Describe the bug
We were trying to enable ETC based on din sample in the hugectr repo and train with our in-house data.
However, we found out that if the dataset was preprocessed into multiple sources, for example
such an error would occur
It worked well with single source.
We used to face other error with multiple sources linked issue.
To Reproduce
run the script
hardware H800
container hugectr 23.06 & 23.09