NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
937 stars 200 forks source link

[Question]When setting use_mixed_precision=True, wdl training does not converge. #393

Closed zpcalan closed 1 year ago

zpcalan commented 1 year ago

Hi, developers of HugeCTR. I create solver by:

solver = hugectr.CreateSolver(
    max_eval_batches=300,
    batchsize_eval=16384,
    batchsize=16384,
    lr=0.001,
    vvgpu=[[0]],
    repeat_dataset=False,
    i64_input_key=True,
    use_mixed_precision=True
)

And I train wdl with day0 criteo data with ETC. After several iterations, the exception below is thrown:

[HCTR][11:19:55.275][INFO][RK0][main]: Eval Time for 300 iters: 3.4955s
[HCTR][11:20:08.988][INFO][RK0][main]: Iter: 7200 Time(200 iters): 17.1874s Loss: 0.136492 lr:0.001
[HCTR][11:20:22.701][INFO][RK0][main]: Iter: 7400 Time(200 iters): 13.6779s Loss: 0.129374 lr:0.001
[HCTR][11:20:36.423][INFO][RK0][main]: Iter: 7600 Time(200 iters): 13.6861s Loss: 0.131023 lr:0.001
[HCTR][11:20:50.775][INFO][RK0][main]: Iter: 7800 Time(200 iters): 14.3157s Loss: 0.129374 lr:0.001
[HCTR][11:21:05.579][INFO][RK0][main]: Iter: 8000 Time(200 iters): 14.7671s Loss: 0.142044 lr:0.001
[HCTR][11:21:09.291][INFO][RK0][main]: Evaluation, AUC: 0.717902
[HCTR][11:21:09.291][INFO][RK0][main]: Eval Time for 300 iters: 3.70926s
[HCTR][11:21:23.835][INFO][RK0][main]: Iter: 8200 Time(200 iters): 18.216s Loss: 0.131776 lr:0.001
[HCTR][11:21:38.619][INFO][RK0][main]: Iter: 8400 Time(200 iters): 14.7486s Loss: 0.138201 lr:0.001
[HCTR][11:21:55.297][INFO][RK0][main]: Iter: 8600 Time(200 iters): 16.6417s Loss: 0.1373 lr:0.001
Traceback (most recent call last):
  File "train.py", line 265, in <module>
    model.fit(max_iter=2300, display=200, eval_interval=1000, snapshot=1000000, snapshot_prefix="wdl", num_epochs=1)
RuntimeError: Train Runtime error: Loss cannot converge /root/HugeCTR/HugeCTR/src/pybind/model.cpp:2162

When setting use_mixed_precision to false, it seems all good.

I check the code and find out that HugeCTR using m and v of Adam optimizer with half data type when using use_mixed_precision=True:

struct SparseEmbeddingHashParams;
template <typename TypeEmbeddingComp>
struct OptimizerTensor {
  Tensor2<TypeEmbeddingComp> opt_z_tensors_;  // FTRL z variable.
  Tensor2<TypeEmbeddingComp> opt_n_tensors_;  // FTRL n variable.
  Tensor2<TypeEmbeddingComp>
      opt_m_tensors_; /**< The mi variable storage for adam optimizer in the update_params(). */
  Tensor2<TypeEmbeddingComp>
      opt_v_tensors_; /**< The vi variable storage for adam optimizer in the update_params(). */
  Tensor2<uint64_t> opt_prev_time_tensors_; /**< The previous update time storage for lazy adam
                                                  in update_params(). */
  Tensor2<TypeEmbeddingComp> opt_momentum_tensors_; /**< The momentum variable storage
                                           for the momentum optimizer in the update_params(). */
  Tensor2<TypeEmbeddingComp> opt_accm_tensors_;     /**< The accm variable storage for the
                                                         nesterov optimizer in the update_params(). */
};

The size of type TypeEmbeddingComp is 2. So I change it to float32 with some other code change. But still, the result does not converge. Can anybody tell me which parameter should I adjust so I can run with correct result when setting use_mixed_precision=True?

I can also offer my training script if that helps. Thank you!

kanghui0204 commented 1 year ago

Hi @zpcalan ,I think we need more infomation:

zpcalan commented 1 year ago

Hi @zpcalan ,I think we need more infomation:

  • Have you try to run without ETC? If you have a try , does it runs well?
  • How do you preprocessed the dataset for ETC? how do you generated the keyset for day()?

I appreciate your quick reply!

  1. Yes, I ran without ETC but it still does not converge. The exception is the same.
  2. I preprocessed day0~day23 data using bash preprocess.sh $i /root/HugeCTR/etc_data/day$i nvt 1 0 1 as this link. As for keyset, I ran this command:
    cmd_str="python generate_keyset.py  --src_dir_path ./etc_data/day"+str(i)+"/train --keyset_path ./etc_data/day"+str(i)+"/train/_hugectr.keyset"
    os.system(cmd_str)

    As you can see, I generate keyset without slot size array. It should not affect convergence.

zpcalan commented 1 year ago

And another information you might be interested in is that I adjust the script preprocess.sh so that it can process whole day's dataset instead of only first 5000000 samples. I think it might be a little issue of this script?

--- a/tools/preprocess.sh
+++ b/tools/preprocess.sh
@@ -50,10 +50,11 @@ fi

 SCRIPT_TYPE=$3

-echo "Getting the first few examples from the uncompressed dataset..."
+sample_num=`wc -l day_$1|awk '{print $1}'`
+echo "Getting the first few examples from the uncompressed dataset... $sample_num"
 mkdir -p $DST_DATA_DIR/train                         && \
 mkdir -p $DST_DATA_DIR/val                           && \
-head -n 5000000 day_$1 > $DST_DATA_DIR/day_$1_small
+head -n $sample_num day_$1 > $DST_DATA_DIR/day_$1_small
 if [ $? -ne 0 ]; then
        echo "Warning: fallback to find original compressed data day_$1.gz..."
        echo "Decompressing day_$1.gz..."
@@ -62,7 +63,7 @@ if [ $? -ne 0 ]; then
                echo "Error: failed to decompress the file."
                exit 2
        fi
-       head -n 5000000 day_$1 > $DST_DATA_DIR/day_$1_small
+       head -n $sample_num day_$1 > $DST_DATA_DIR/day_$1_small
        if [ $? -ne 0 ]; then
                echo "Error: day_$1 file"
                exit 2
@@ -111,7 +112,7 @@ if [[ $SCRIPT_TYPE == "nvt" ]]; then
                --freq_limit 6                        \
                --device_limit_frac 0.5               \
                --device_pool_frac 0.5                \
-               --out_files_per_proc 8                \
+               --out_files_per_proc 20                \
                --devices "0"                         \
                --num_io_threads 2                    \
         --parquet_format=$IS_PARQUET_FORMAT   \
kanghui0204 commented 1 year ago

Hi @zpcalan ,I guess you include more samples , do you modify the workspace_size_per_gpu_in_mb orslot_size_array value? those parameters should be increased accordingly

zpcalan commented 1 year ago

Yes, I change workspace_size_per_gpu_in_mb to 4048 because I set embedding_vec_size to 240. But I didn't change slot_size_array when I generate keyset files as this comment says. In my script, I also didn't change slot_size_array so it's all 0s.

JacoCheung commented 1 year ago

Hi @zpcalan , I assumed that you were using Parquet Dataset, right? How is your slot_size_array like in DataReaderParams? Is the script on #395 throwing such error? If so, I think the problem might be you have left the slot_size_array alone. Please refer to the doc:

slot_size_array: List[int], specify the maximum key value for each slot. Refer to the following equation. The array should be consistent with that of the sparse input. HugeCTR requires this argument for Parquet format data and RawAsync format when you want to add an offset to the input key. The default value is an empty list.

PS. Let's focus on the model without ETC feature first.

JacoCheung commented 1 year ago

BTW, @zpcalan have you ever tried without mixed precision training?

zpcalan commented 1 year ago

@JacoCheung Yes, the script in 395 issue throws this error as well.(Without ETC) And yes, I have tried without mixed precision training. The result is correct and no exception is thrown.

slot_size_array: List[int], specify the maximum key value for each slot. Refer to the following equation. The array should be consistent with that of the sparse input. HugeCTR requires this argument for Parquet format data and RawAsync format when you want to add an offset to the input key. The default value is an empty list.

I didn't set slot_size_array because I don't need to add offset to the key. I think each catagorical feature of each sample is unique globally. So I don't quite understand why offset should be added when I use one GPU card to train this model.

All description above is without ETC. Do you have any good suggestion about this?

JacoCheung commented 1 year ago

No, I think if you're using our preprocessing script, there is no guarantee that keys range of 2 slots are unique. For instance, C0 and C1 have the chance to be identical [12, 12... ].

zpcalan commented 1 year ago

No, I think if you're using our preprocessing script, there is no guarantee that keys range of 2 slots are unique. For instance, C0 and C1 have the chance to be identical [12, 12... ].

Do you mean C0 and C1 of one sample could be both 12? If so, I understand the offset must be added so that keys of each slot in a sample are unique. But why is this happening.

I will set slot_size_array and run FP16 training. It is printed when preprocessing dataset, right? Just like this:

Preprocessing
Train Datasets Preprocessing.....
[932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34]
Valid Datasets Preprocessing.....
[932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34]

So I will set slot_size_array to [932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34].

JacoCheung commented 1 year ago

But why is this happening.

The uniqueness requirement derives from the Embedding, you can assume that for Norm or Raw, the keys are already guaranteed to be unique in the preprocessing. But for Parquet, it's the data reader's duty to add a offset to make them unique. The data preprocessing of Parquet is done per feature. Different slots would not interfere with each other. For example, C0 may indicate the user_id while C1 may indicate the item_id, nvt will process them individually so that C0 and C1 all start with 0. Sorry for the inconsistency, we're trying to improve the user experience. You can try embedding collection, which is a uniform and new embedding type.

It is printed when preprocessing dataset, right?

Yes.

zpcalan commented 1 year ago

Thanks for explaining!

After offset is added to each key, total unique key number will be summation of [932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34],right?

If so, single card's memory size will not be enough for embedding. ETC must be introduced. My original goal is to run day0-23 ETC training with FP16. I can run FP16 training with small dataset and if it converges, can I use ETC training to run day0's dataset?

zpcalan commented 1 year ago

Hi, developers of HugeCTR. I tried to run FP16 training with small dataset and parquet data format, and slot_size_array correctly assigned. but after several steps the auc continued to decline and finally it does not converge. The script is:

import hugectr
from mpi4py import MPI

solver = hugectr.CreateSolver(
    max_eval_batches=300,
    batchsize_eval=16384,
    batchsize=16384,
    lr=0.001,
    vvgpu=[[0]],
    repeat_dataset=False,
    i64_input_key=True,
    use_mixed_precision=True
)

reader = hugectr.DataReaderParams(
    data_reader_type=hugectr.DataReaderType_t.Parquet,
    source=["./etc_data/small_day0/train/_file_list.txt"],
    keyset = ["./etc_data/small_day0/train/_hugectr.keyset"],
    eval_source="./etc_data/small_day0/val/_file_list.txt",
    slot_size_array=[13332, 31854, 10950, 6598, 8743, 4261, 10704, 4, 5009, 991, 28, 12198, 8710, 13823, 10, 1426, 3437, 48, 4, 643, 15, 10765, 11862, 11316, 8432, 5954, 42, 33],
    check_type=hugectr.Check_t.Non,
)

optimizer = hugectr.CreateOptimizer(
    optimizer_type=hugectr.Optimizer_t.Adam,
    update_type=hugectr.Update_t.Global,
    beta1=0.9,
    beta2=0.999,
    epsilon=0.0000001,
)

model = hugectr.Model(solver, reader, optimizer)
model.add(
    hugectr.Input(
        label_dim=1,
        label_name="label",
        dense_dim=13,
        dense_name="dense",
        data_reader_sparse_param_array=[
            hugectr.DataReaderSparseParam("wide_data", 30, True, 1),
            hugectr.DataReaderSparseParam("deep_data", 2, False, 26),
        ],
    )
)
model.add(
    hugectr.SparseEmbedding(
        embedding_type=hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
        workspace_size_per_gpu_in_mb=69,
        embedding_vec_size=1,
        combiner="sum",
        sparse_embedding_name="sparse_embedding2",
        bottom_name="wide_data",
        optimizer=optimizer,
    )
)
model.add(
    hugectr.SparseEmbedding(
        embedding_type=hugectr.Embedding_t.LocalizedSlotSparseEmbeddingHash,
        workspace_size_per_gpu_in_mb=1024,
        embedding_vec_size=240,
        combiner="sum",
        sparse_embedding_name="sparse_embedding1",
        bottom_name="deep_data",
        optimizer=optimizer,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Reshape,
        bottom_names=["sparse_embedding1"],
        top_names=["reshape1"],
        leading_dim=6240,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Reshape,
        bottom_names=["sparse_embedding2"],
        top_names=["reshape2"],
        leading_dim=1,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Concat, bottom_names=["reshape1", "dense"], top_names=["concat1"]
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["concat1"],
        top_names=["fc1"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc1"], top_names=["relu1"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu1"],
        top_names=["dropout1"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout1"],
        top_names=["fc2"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc2"], top_names=["relu2"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu2"],
        top_names=["dropout2"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout2"],
        top_names=["fc3"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc3"], top_names=["relu3"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu3"],
        top_names=["dropout3"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout3"],
        top_names=["fc4"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc4"], top_names=["relu4"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu4"],
        top_names=["dropout4"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout4"],
        top_names=["fc5"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc5"], top_names=["relu5"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu5"],
        top_names=["dropout5"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout5"],
        top_names=["fc6"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc6"], top_names=["relu6"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu6"],
        top_names=["dropout6"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout6"],
        top_names=["fc7"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc7"], top_names=["relu7"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu7"],
        top_names=["dropout7"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout7"],
        top_names=["fc8"],
        num_output=1,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Add, bottom_names=["fc8", "reshape2"], top_names=["add1"]
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.BinaryCrossEntropyLoss,
        bottom_names=["add1", "label"],
        top_names=["loss"],
    )
)
model.compile()
model.summary()
model.fit(max_iter=1, display=10, eval_interval=40, snapshot=1000000, snapshot_prefix="wdl", num_epochs=100)

The result:

[HCTR][12:21:15.423][INFO][RK0][main]: Eval Time for 300 iters: 0.66509s
[HCTR][12:21:15.900][INFO][RK0][main]: Iter: 4170 Time(10 iters): 1.13876s Loss: 0.0917276 lr:0.001
[HCTR][12:21:16.438][INFO][RK0][main]: Iter: 4180 Time(10 iters): 0.533409s Loss: 0.0895059 lr:0.001
[HCTR][12:21:16.939][INFO][RK0][main]: Iter: 4190 Time(10 iters): 0.495925s Loss: 0.086796 lr:0.001
[HCTR][12:21:17.395][INFO][RK0][main]: Iter: 4200 Time(10 iters): 0.451196s Loss: 0.0831752 lr:0.001
[HCTR][12:21:18.081][INFO][RK0][main]: Evaluation, AUC: 0.602723
[HCTR][12:21:18.081][INFO][RK0][main]: Eval Time for 300 iters: 0.684147s
[HCTR][12:21:18.552][INFO][RK0][main]: Iter: 4210 Time(10 iters): 1.15236s Loss: 0.0809434 lr:0.001
[HCTR][12:21:18.684][INFO][RK0][main]: train drop incomplete batch. batchsize:10752
[HCTR][12:21:18.684][INFO][RK0][main]: -----------------------------------Epoch 43-----------------------------------
[HCTR][12:21:19.083][INFO][RK0][main]: Iter: 4220 Time(10 iters): 0.526233s Loss: 0.077393 lr:0.001
[HCTR][12:21:19.552][INFO][RK0][main]: Iter: 4230 Time(10 iters): 0.464746s Loss: 0.078123 lr:0.001
[HCTR][12:21:20.005][INFO][RK0][main]: Iter: 4240 Time(10 iters): 0.447292s Loss: 0.0808783 lr:0.001
[HCTR][12:21:20.694][INFO][RK0][main]: Evaluation, AUC: 0.611602
[HCTR][12:21:20.694][INFO][RK0][main]: Eval Time for 300 iters: 0.688137s
Traceback (most recent call last):
  File "small_wdl.py", line 267, in <module>
    model.fit(max_iter=1, display=10, eval_interval=40, snapshot=1000000, snapshot_prefix="wdl", num_epochs=1000)
RuntimeError: Train Runtime error: Loss cannot converge /hugectr/HugeCTR/src/pybind/model.cpp:2019

Is there something wrong with the script?

zpcalan commented 1 year ago

Hi, any progress on this issue? It seems some convergence issue when using use_mixed_precision=True.

JacoCheung commented 1 year ago

It looks like you were using epoch mode and the dataset contained 1642300(= 4210 / 42 * 16384) samples or so, right? Could you please post the AUC for each epoch? I'd like to know when the AUC did start to drop.

In addition, we had not tried enabling fp16 training for this model, the hyper parameters may be subtly different from of fp32 training. For example, the ,scaler, learning_rate, etc. Please refer to solver document for more details.

JacoCheung commented 1 year ago

Becuse HugeCTR does not support dynamic scaler, the divergence issue sometimes occurs if there's fp16 overflow (For example, the weight is too large and the gemm will produce intermediate numeric larger than 65,504(Inf) and the inf will propagate).

zpcalan commented 1 year ago

It looks like you were using epoch mode and the dataset contained 1642300(= 4210 / 42 * 16384) samples or so, right? Could you please post the AUC for each epoch? I'd like to know when the AUC did start to drop.

In addition, we had not tried enabling fp16 training for this model, the hyper parameters may be subtly different from of fp32 training. For example, the ,scaler, learning_rate, etc. Please refer to solver document for more details.

I see. It seems fp16 training is not fully tested well and some hyper parameters should be adjusted. The AUC:

[HCTR][03:41:14.346][INFO][RK0][main]: -----------------------------------Epoch 0-----------------------------------
[HCTR][03:41:17.143][INFO][RK0][main]: Evaluation, AUC: 0.614141
[HCTR][03:41:19.711][INFO][RK0][main]: Evaluation, AUC: 0.660234
[HCTR][03:41:20.493][INFO][RK0][main]: -----------------------------------Epoch 1-----------------------------------
[HCTR][03:41:22.209][INFO][RK0][main]: Evaluation, AUC: 0.539144
[HCTR][03:41:24.770][INFO][RK0][main]: Evaluation, AUC: 0.678606
[HCTR][03:41:26.449][INFO][RK0][main]: -----------------------------------Epoch 2-----------------------------------
[HCTR][03:41:27.352][INFO][RK0][main]: Evaluation, AUC: 0.682721
[HCTR][03:41:30.001][INFO][RK0][main]: Evaluation, AUC: 0.674805
[HCTR][03:41:32.636][INFO][RK0][main]: Evaluation, AUC: 0.631986
[HCTR][03:41:33.225][INFO][RK0][main]: -----------------------------------Epoch 3-----------------------------------
[HCTR][03:41:35.237][INFO][RK0][main]: Evaluation, AUC: 0.691028
[HCTR][03:41:37.849][INFO][RK0][main]: Evaluation, AUC: 0.705744
[HCTR][03:41:39.359][INFO][RK0][main]: -----------------------------------Epoch 4-----------------------------------
[HCTR][03:41:40.451][INFO][RK0][main]: Evaluation, AUC: 0.681173
[HCTR][03:41:43.091][INFO][RK0][main]: Evaluation, AUC: 0.705617
[HCTR][03:41:45.716][INFO][RK0][main]: Evaluation, AUC: 0.700359
[HCTR][03:41:46.108][INFO][RK0][main]: -----------------------------------Epoch 5-----------------------------------
[HCTR][03:41:48.327][INFO][RK0][main]: Evaluation, AUC: 0.696902
[HCTR][03:41:50.903][INFO][RK0][main]: Evaluation, AUC: 0.659046
[HCTR][03:41:52.239][INFO][RK0][main]: -----------------------------------Epoch 6-----------------------------------
[HCTR][03:41:53.527][INFO][RK0][main]: Evaluation, AUC: 0.700403
[HCTR][03:41:56.142][INFO][RK0][main]: Evaluation, AUC: 0.698039
[HCTR][03:41:58.755][INFO][RK0][main]: Evaluation, AUC: 0.688778
[HCTR][03:41:58.935][INFO][RK0][main]: -----------------------------------Epoch 7-----------------------------------
[HCTR][03:42:01.342][INFO][RK0][main]: Evaluation, AUC: 0.684459
[HCTR][03:42:03.961][INFO][RK0][main]: Evaluation, AUC: 0.689478
[HCTR][03:42:05.085][INFO][RK0][main]: -----------------------------------Epoch 8-----------------------------------
[HCTR][03:42:06.556][INFO][RK0][main]: Evaluation, AUC: 0.693916
[HCTR][03:42:09.151][INFO][RK0][main]: Evaluation, AUC: 0.691887
[HCTR][03:42:11.748][INFO][RK0][main]: Evaluation, AUC: 0.685392
[HCTR][03:42:11.748][INFO][RK0][main]: -----------------------------------Epoch 9-----------------------------------
[HCTR][03:42:14.344][INFO][RK0][main]: Evaluation, AUC: 0.682643
[HCTR][03:42:16.962][INFO][RK0][main]: Evaluation, AUC: 0.683189
[HCTR][03:42:17.884][INFO][RK0][main]: -----------------------------------Epoch 10-----------------------------------
[HCTR][03:42:19.543][INFO][RK0][main]: Evaluation, AUC: 0.6777
[HCTR][03:42:22.155][INFO][RK0][main]: Evaluation, AUC: 0.67126
[HCTR][03:42:23.952][INFO][RK0][main]: -----------------------------------Epoch 11-----------------------------------
[HCTR][03:42:24.773][INFO][RK0][main]: Evaluation, AUC: 0.676683
[HCTR][03:42:27.377][INFO][RK0][main]: Evaluation, AUC: 0.678563
[HCTR][03:42:30.015][INFO][RK0][main]: Evaluation, AUC: 0.677846
[HCTR][03:42:30.718][INFO][RK0][main]: -----------------------------------Epoch 12-----------------------------------
[HCTR][03:42:32.602][INFO][RK0][main]: Evaluation, AUC: 0.675074
[HCTR][03:42:35.200][INFO][RK0][main]: Evaluation, AUC: 0.669867
[HCTR][03:42:36.814][INFO][RK0][main]: -----------------------------------Epoch 13-----------------------------------
[HCTR][03:42:37.815][INFO][RK0][main]: Evaluation, AUC: 0.67586
[HCTR][03:42:40.420][INFO][RK0][main]: Evaluation, AUC: 0.667391
[HCTR][03:42:43.063][INFO][RK0][main]: Evaluation, AUC: 0.668702
[HCTR][03:42:43.559][INFO][RK0][main]: -----------------------------------Epoch 14-----------------------------------
[HCTR][03:42:45.641][INFO][RK0][main]: Evaluation, AUC: 0.675038
[HCTR][03:42:48.260][INFO][RK0][main]: Evaluation, AUC: 0.670299
[HCTR][03:42:49.677][INFO][RK0][main]: -----------------------------------Epoch 15-----------------------------------
[HCTR][03:42:50.854][INFO][RK0][main]: Evaluation, AUC: 0.671158
[HCTR][03:42:53.461][INFO][RK0][main]: Evaluation, AUC: 0.663839
[HCTR][03:42:56.069][INFO][RK0][main]: Evaluation, AUC: 0.659254
[HCTR][03:42:56.352][INFO][RK0][main]: -----------------------------------Epoch 16-----------------------------------
[HCTR][03:42:58.663][INFO][RK0][main]: Evaluation, AUC: 0.656877
[HCTR][03:43:01.294][INFO][RK0][main]: Evaluation, AUC: 0.663891
[HCTR][03:43:02.550][INFO][RK0][main]: -----------------------------------Epoch 17-----------------------------------
[HCTR][03:43:03.950][INFO][RK0][main]: Evaluation, AUC: 0.672403
[HCTR][03:43:06.594][INFO][RK0][main]: Evaluation, AUC: 0.665769
[HCTR][03:43:09.210][INFO][RK0][main]: Evaluation, AUC: 0.653346
[HCTR][03:43:09.284][INFO][RK0][main]: -----------------------------------Epoch 18-----------------------------------
[HCTR][03:43:11.797][INFO][RK0][main]: Evaluation, AUC: 0.648996
[HCTR][03:43:14.461][INFO][RK0][main]: Evaluation, AUC: 0.663004
[HCTR][03:43:15.481][INFO][RK0][main]: -----------------------------------Epoch 19-----------------------------------
[HCTR][03:43:17.026][INFO][RK0][main]: Evaluation, AUC: 0.667177
[HCTR][03:43:19.647][INFO][RK0][main]: Evaluation, AUC: 0.6584
[HCTR][03:43:21.530][INFO][RK0][main]: -----------------------------------Epoch 20-----------------------------------
[HCTR][03:43:22.291][INFO][RK0][main]: Evaluation, AUC: 0.653862
[HCTR][03:43:24.931][INFO][RK0][main]: Evaluation, AUC: 0.648851
[HCTR][03:43:27.568][INFO][RK0][main]: Evaluation, AUC: 0.657728
[HCTR][03:43:28.377][INFO][RK0][main]: -----------------------------------Epoch 21-----------------------------------
[HCTR][03:43:30.161][INFO][RK0][main]: Evaluation, AUC: 0.663705
[HCTR][03:43:32.785][INFO][RK0][main]: Evaluation, AUC: 0.648212
[HCTR][03:43:34.498][INFO][RK0][main]: -----------------------------------Epoch 22-----------------------------------
[HCTR][03:43:35.435][INFO][RK0][main]: Evaluation, AUC: 0.623344
[HCTR][03:43:38.061][INFO][RK0][main]: Evaluation, AUC: 0.642473
[HCTR][03:43:40.682][INFO][RK0][main]: Evaluation, AUC: 0.641618
[HCTR][03:43:41.272][INFO][RK0][main]: -----------------------------------Epoch 23-----------------------------------
[HCTR][03:43:43.283][INFO][RK0][main]: Evaluation, AUC: 0.653915
[HCTR][03:43:45.875][INFO][RK0][main]: Evaluation, AUC: 0.635579
[HCTR][03:43:47.386][INFO][RK0][main]: -----------------------------------Epoch 24-----------------------------------
[HCTR][03:43:48.470][INFO][RK0][main]: Evaluation, AUC: 0.641135
[HCTR][03:43:51.080][INFO][RK0][main]: Evaluation, AUC: 0.629272
[HCTR][03:43:53.721][INFO][RK0][main]: Evaluation, AUC: 0.637833
[HCTR][03:43:54.113][INFO][RK0][main]: -----------------------------------Epoch 25-----------------------------------
[HCTR][03:43:56.325][INFO][RK0][main]: Evaluation, AUC: 0.634068
[HCTR][03:43:58.966][INFO][RK0][main]: Evaluation, AUC: 0.632516
[HCTR][03:44:00.293][INFO][RK0][main]: -----------------------------------Epoch 26-----------------------------------
[HCTR][03:44:01.604][INFO][RK0][main]: Evaluation, AUC: 0.63881
[HCTR][03:44:04.189][INFO][RK0][main]: Evaluation, AUC: 0.629136
[HCTR][03:44:06.816][INFO][RK0][main]: Evaluation, AUC: 0.636865
[HCTR][03:44:06.995][INFO][RK0][main]: -----------------------------------Epoch 27-----------------------------------
[HCTR][03:44:09.392][INFO][RK0][main]: Evaluation, AUC: 0.630674
[HCTR][03:44:12.019][INFO][RK0][main]: Evaluation, AUC: 0.646079
[HCTR][03:44:13.148][INFO][RK0][main]: -----------------------------------Epoch 28-----------------------------------
[HCTR][03:44:14.626][INFO][RK0][main]: Evaluation, AUC: 0.652931
[HCTR][03:44:17.289][INFO][RK0][main]: Evaluation, AUC: 0.64743
[HCTR][03:44:19.890][INFO][RK0][main]: Evaluation, AUC: 0.636667
[HCTR][03:44:19.891][INFO][RK0][main]: -----------------------------------Epoch 29-----------------------------------
[HCTR][03:44:22.500][INFO][RK0][main]: Evaluation, AUC: 0.628411
[HCTR][03:44:25.026][INFO][RK0][main]: Evaluation, AUC: 0.643507
[HCTR][03:44:25.901][INFO][RK0][main]: -----------------------------------Epoch 30-----------------------------------
[HCTR][03:44:27.528][INFO][RK0][main]: Evaluation, AUC: 0.624483
[HCTR][03:44:30.098][INFO][RK0][main]: Evaluation, AUC: 0.627787
[HCTR][03:44:31.862][INFO][RK0][main]: -----------------------------------Epoch 31-----------------------------------
[HCTR][03:44:32.693][INFO][RK0][main]: Evaluation, AUC: 0.610337
[HCTR][03:44:35.310][INFO][RK0][main]: Evaluation, AUC: 0.629604
[HCTR][03:44:37.944][INFO][RK0][main]: Evaluation, AUC: 0.643551
[HCTR][03:44:38.666][INFO][RK0][main]: -----------------------------------Epoch 32-----------------------------------
[HCTR][03:44:40.527][INFO][RK0][main]: Evaluation, AUC: 0.625997
[HCTR][03:44:43.141][INFO][RK0][main]: Evaluation, AUC: 0.639994
[HCTR][03:44:44.730][INFO][RK0][main]: -----------------------------------Epoch 33-----------------------------------
[HCTR][03:44:45.742][INFO][RK0][main]: Evaluation, AUC: 0.626142
[HCTR][03:44:48.390][INFO][RK0][main]: Evaluation, AUC: 0.624291
[HCTR][03:44:51.018][INFO][RK0][main]: Evaluation, AUC: 0.62945
[HCTR][03:44:51.515][INFO][RK0][main]: -----------------------------------Epoch 34-----------------------------------
[HCTR][03:44:53.610][INFO][RK0][main]: Evaluation, AUC: 0.640553
[HCTR][03:44:56.249][INFO][RK0][main]: Evaluation, AUC: 0.634215
[HCTR][03:44:57.689][INFO][RK0][main]: -----------------------------------Epoch 35-----------------------------------
[HCTR][03:44:58.893][INFO][RK0][main]: Evaluation, AUC: 0.634598
[HCTR][03:45:01.556][INFO][RK0][main]: Evaluation, AUC: 0.618847
[HCTR][03:45:04.184][INFO][RK0][main]: Evaluation, AUC: 0.633745
[HCTR][03:45:04.461][INFO][RK0][main]: -----------------------------------Epoch 36-----------------------------------
[HCTR][03:45:06.769][INFO][RK0][main]: Evaluation, AUC: 0.634274
[HCTR][03:45:09.360][INFO][RK0][main]: Evaluation, AUC: 0.631094
[HCTR][03:45:10.607][INFO][RK0][main]: -----------------------------------Epoch 37-----------------------------------
[HCTR][03:45:11.965][INFO][RK0][main]: Evaluation, AUC: 0.635363
[HCTR][03:45:14.611][INFO][RK0][main]: Evaluation, AUC: 0.63397
[HCTR][03:45:17.225][INFO][RK0][main]: Evaluation, AUC: 0.62498
[HCTR][03:45:17.298][INFO][RK0][main]: -----------------------------------Epoch 38-----------------------------------
[HCTR][03:45:19.805][INFO][RK0][main]: Evaluation, AUC: 0.631844
[HCTR][03:45:22.421][INFO][RK0][main]: Evaluation, AUC: 0.616634
[HCTR][03:45:23.465][INFO][RK0][main]: -----------------------------------Epoch 39-----------------------------------
[HCTR][03:45:25.027][INFO][RK0][main]: Evaluation, AUC: 0.637862
[HCTR][03:45:27.672][INFO][RK0][main]: Evaluation, AUC: 0.62419
[HCTR][03:45:29.549][INFO][RK0][main]: -----------------------------------Epoch 40-----------------------------------
[HCTR][03:45:30.251][INFO][RK0][main]: Evaluation, AUC: 0.605228
[HCTR][03:45:32.900][INFO][RK0][main]: Evaluation, AUC: 0.622567
[HCTR][03:45:35.538][INFO][RK0][main]: Evaluation, AUC: 0.634089
[HCTR][03:45:36.353][INFO][RK0][main]: -----------------------------------Epoch 41-----------------------------------
[HCTR][03:45:38.122][INFO][RK0][main]: Evaluation, AUC: 0.633255
[HCTR][03:45:40.747][INFO][RK0][main]: Evaluation, AUC: 0.637025
[HCTR][03:45:42.439][INFO][RK0][main]: -----------------------------------Epoch 42-----------------------------------
[HCTR][03:45:43.339][INFO][RK0][main]: Evaluation, AUC: 0.628744
[HCTR][03:45:45.979][INFO][RK0][main]: Evaluation, AUC: 0.627138
[HCTR][03:45:48.624][INFO][RK0][main]: Evaluation, AUC: 0.61648
[HCTR][03:45:49.228][INFO][RK0][main]: -----------------------------------Epoch 43-----------------------------------
[HCTR][03:45:51.225][INFO][RK0][main]: Evaluation, AUC: 0.622604
[HCTR][03:45:53.792][INFO][RK0][main]: Evaluation, AUC: 0.634602
[HCTR][03:45:55.257][INFO][RK0][main]: -----------------------------------Epoch 44-----------------------------------
[HCTR][03:45:56.344][INFO][RK0][main]: Evaluation, AUC: 0.631658
[HCTR][03:45:58.894][INFO][RK0][main]: Evaluation, AUC: 0.621491
[HCTR][03:46:01.440][INFO][RK0][main]: Evaluation, AUC: 0.625775
[HCTR][03:46:01.818][INFO][RK0][main]: -----------------------------------Epoch 45-----------------------------------
[HCTR][03:46:03.927][INFO][RK0][main]: Evaluation, AUC: 0.614879
[HCTR][03:46:06.446][INFO][RK0][main]: Evaluation, AUC: 0.621894
[HCTR][03:46:07.737][INFO][RK0][main]: -----------------------------------Epoch 46-----------------------------------
[HCTR][03:46:08.961][INFO][RK0][main]: Evaluation, AUC: 0.623026
[HCTR][03:46:11.477][INFO][RK0][main]: Evaluation, AUC: 0.620515
[HCTR][03:46:14.014][INFO][RK0][main]: Evaluation, AUC: 0.62285
[HCTR][03:46:14.190][INFO][RK0][main]: -----------------------------------Epoch 47-----------------------------------
[HCTR][03:46:16.517][INFO][RK0][main]: Evaluation, AUC: 0.609457
[HCTR][03:46:19.059][INFO][RK0][main]: Evaluation, AUC: 0.630078
[HCTR][03:46:20.150][INFO][RK0][main]: -----------------------------------Epoch 48-----------------------------------
[HCTR][03:46:21.592][INFO][RK0][main]: Evaluation, AUC: 0.640645
[HCTR][03:46:24.096][INFO][RK0][main]: Evaluation, AUC: 0.630915
[HCTR][03:46:26.641][INFO][RK0][main]: Evaluation, AUC: 0.626881
[HCTR][03:46:26.642][INFO][RK0][main]: -----------------------------------Epoch 49-----------------------------------
[HCTR][03:46:29.174][INFO][RK0][main]: Evaluation, AUC: 0.624196
[HCTR][03:46:31.738][INFO][RK0][main]: Evaluation, AUC: 0.635639
[HCTR][03:46:32.630][INFO][RK0][main]: -----------------------------------Epoch 50-----------------------------------
[HCTR][03:46:34.234][INFO][RK0][main]: Evaluation, AUC: 0.625333
[HCTR][03:46:36.771][INFO][RK0][main]: Evaluation, AUC: 0.606624
[HCTR][03:46:38.501][INFO][RK0][main]: -----------------------------------Epoch 51-----------------------------------
[HCTR][03:46:39.301][INFO][RK0][main]: Evaluation, AUC: 0.615676
[HCTR][03:46:41.823][INFO][RK0][main]: Evaluation, AUC: 0.62022
[HCTR][03:46:44.350][INFO][RK0][main]: Evaluation, AUC: 0.624788
[HCTR][03:46:45.024][INFO][RK0][main]: -----------------------------------Epoch 52-----------------------------------
[HCTR][03:46:46.837][INFO][RK0][main]: Evaluation, AUC: 0.617381
[HCTR][03:46:49.374][INFO][RK0][main]: Evaluation, AUC: 0.635112
[HCTR][03:46:50.935][INFO][RK0][main]: -----------------------------------Epoch 53-----------------------------------
[HCTR][03:46:51.903][INFO][RK0][main]: Evaluation, AUC: 0.631386
[HCTR][03:46:54.435][INFO][RK0][main]: Evaluation, AUC: 0.616767
[HCTR][03:46:56.954][INFO][RK0][main]: Evaluation, AUC: 0.61728
[HCTR][03:46:57.431][INFO][RK0][main]: -----------------------------------Epoch 54-----------------------------------
[HCTR][03:46:59.458][INFO][RK0][main]: Evaluation, AUC: 0.628254
[HCTR][03:47:01.989][INFO][RK0][main]: Evaluation, AUC: 0.61775
[HCTR][03:47:03.370][INFO][RK0][main]: -----------------------------------Epoch 55-----------------------------------
[HCTR][03:47:04.515][INFO][RK0][main]: Evaluation, AUC: 0.629893
[HCTR][03:47:07.081][INFO][RK0][main]: Evaluation, AUC: 0.609286
[HCTR][03:47:09.617][INFO][RK0][main]: Evaluation, AUC: 0.627195
[HCTR][03:47:09.895][INFO][RK0][main]: -----------------------------------Epoch 56-----------------------------------
[HCTR][03:47:12.126][INFO][RK0][main]: Evaluation, AUC: 0.61822
[HCTR][03:47:15.025][INFO][RK0][main]: Evaluation, AUC: 0.619697
[HCTR][03:47:16.418][INFO][RK0][main]: -----------------------------------Epoch 57-----------------------------------
[HCTR][03:47:18.096][INFO][RK0][main]: Evaluation, AUC: 0.628677
[HCTR][03:47:21.540][INFO][RK0][main]: Evaluation, AUC: 0.632847
[HCTR][03:47:24.633][INFO][RK0][main]: Evaluation, AUC: 0.626757
[HCTR][03:47:24.758][INFO][RK0][main]: -----------------------------------Epoch 58-----------------------------------
[HCTR][03:47:27.734][INFO][RK0][main]: Evaluation, AUC: 0.630998
[HCTR][03:47:31.249][INFO][RK0][main]: Evaluation, AUC: 0.624877
[HCTR][03:47:32.528][INFO][RK0][main]: -----------------------------------Epoch 59-----------------------------------
[HCTR][03:47:34.863][INFO][RK0][main]: Evaluation, AUC: 0.626591
[HCTR][03:47:38.615][INFO][RK0][main]: Evaluation, AUC: 0.613877
[HCTR][03:47:40.785][INFO][RK0][main]: -----------------------------------Epoch 60-----------------------------------
[HCTR][03:47:41.855][INFO][RK0][main]: Evaluation, AUC: 0.617227
[HCTR][03:47:45.216][INFO][RK0][main]: Evaluation, AUC: 0.604119
[HCTR][03:47:48.435][INFO][RK0][main]: Evaluation, AUC: 0.610586
[HCTR][03:47:49.617][INFO][RK0][main]: -----------------------------------Epoch 61-----------------------------------
[HCTR][03:47:51.822][INFO][RK0][main]: Evaluation, AUC: 0.619046
[HCTR][03:47:55.262][INFO][RK0][main]: Evaluation, AUC: 0.61385
[HCTR][03:47:57.630][INFO][RK0][main]: -----------------------------------Epoch 62-----------------------------------
[HCTR][03:47:58.712][INFO][RK0][main]: Evaluation, AUC: 0.622791
[HCTR][03:48:02.060][INFO][RK0][main]: Evaluation, AUC: 0.617612
[HCTR][03:48:05.560][INFO][RK0][main]: Evaluation, AUC: 0.607431
[HCTR][03:48:06.490][INFO][RK0][main]: -----------------------------------Epoch 63-----------------------------------
[HCTR][03:48:09.092][INFO][RK0][main]: Evaluation, AUC: 0.622129
[HCTR][03:48:12.705][INFO][RK0][main]: Evaluation, AUC: 0.616801
[HCTR][03:48:14.504][INFO][RK0][main]: -----------------------------------Epoch 64-----------------------------------
[HCTR][03:48:15.979][INFO][RK0][main]: Evaluation, AUC: 0.611538
[HCTR][03:48:19.146][INFO][RK0][main]: Evaluation, AUC: 0.61975
[HCTR][03:48:22.412][INFO][RK0][main]: Evaluation, AUC: 0.601556
[HCTR][03:48:22.810][INFO][RK0][main]: -----------------------------------Epoch 65-----------------------------------
[HCTR][03:48:26.073][INFO][RK0][main]: Evaluation, AUC: 0.605704
[HCTR][03:48:29.257][INFO][RK0][main]: Evaluation, AUC: 0.617283
[HCTR][03:48:30.897][INFO][RK0][main]: -----------------------------------Epoch 66-----------------------------------
[HCTR][03:48:32.606][INFO][RK0][main]: Evaluation, AUC: 0.592738
[HCTR][03:48:35.761][INFO][RK0][main]: Evaluation, AUC: 0.60789
[HCTR][03:48:38.972][INFO][RK0][main]: Evaluation, AUC: 0.607424
[HCTR][03:48:39.223][INFO][RK0][main]: -----------------------------------Epoch 67-----------------------------------
[HCTR][03:48:42.298][INFO][RK0][main]: Evaluation, AUC: 0.611191
[HCTR][03:48:45.413][INFO][RK0][main]: Evaluation, AUC: 0.621895
[HCTR][03:48:46.675][INFO][RK0][main]: -----------------------------------Epoch 68-----------------------------------
[HCTR][03:48:48.483][INFO][RK0][main]: Evaluation, AUC: 0.625431
[HCTR][03:48:51.841][INFO][RK0][main]: Evaluation, AUC: 0.581377
[HCTR][03:48:54.986][INFO][RK0][main]: Evaluation, AUC: 0.580077
[HCTR][03:48:54.986][INFO][RK0][main]: -----------------------------------Epoch 69-----------------------------------
[HCTR][03:48:58.409][INFO][RK0][main]: Evaluation, AUC: 0.589024
JacoCheung commented 1 year ago

Yes, the hyper params (in the worst case, you had to opt another optimizer) should be adjusted. We have not fully tested fp16 training for all models.

There are two remarks from the AUC log you posted:

  1. The AUC in the first epoch is much smaller than that with fp32. It may imply that this model+current configuration are not suitale for fp16 training.

  2. The AUC tends to drop after many epochs. It may imply overfitting.

Anyway, I recommend you to adjust some hyper parameters.

zpcalan commented 1 year ago

Thanks. I will adjust as you suggest. By the way, could you please check this issue about ETC? I am trying to run this test regardless disconvergence. I appreciate it a lot!

zpcalan commented 1 year ago

I am trying not to use use_mixed_precision parameter and manully add hugectr.Layer_t.Cast layer instead to fix this convergence issue while keep the performance consistent. But the API document seems lack of hugectr.Layer_t.Cast layer's description. How can I use this type of DenseLayer to cast output to float16, for example.

JacoCheung commented 1 year ago

Hi @zpcalan , I'm afraid you can not manually cast the data type and feed the tensor to a layer of different datatype. The used_mixed_precision flag has a global impact. If it's off, HugeCTR assumes all input tensors of all layer have fp32 datatype, while if it's on, all inputs to layers must have fp16 data type. HugeCTR does not support fp16 for a specific layer. In fp32 mode, the cast layer will cast fp32 into fp16; in fp16 mode, the cast layer cast from fp16 to fp32.

zpcalan commented 1 year ago

I use dlrm instead of W&D because the parameters in the script have been already configured correctly for mixed precision. Issue closed. : )