alibaba / ai-matrix

To make it easy to benchmark AI accelerators
Other
179 stars 79 forks source link

Bugs in DIEN and DIEN_TF2, both got nan when training with prepare_data.sh #79

Open xuechendi opened 3 years ago

xuechendi commented 3 years ago

Hi, Ali ai-matrix team

I recently tried this repo and verified on DIEN. Somehow, I verified both using prepare_dataset.sh and prepare_data.sh to prepare data for training, and I noticed that it seems current DIEN codes only works with prepare_dataset.sh and if I used prepare_data.sh to do feature enabling, training will always got nan. see pic as below: image

Is this a known issue? I also tried another repo from ali, https://github.com/alibaba/x-deeplearning/tree/master/xdl-algorithm-solution/DIEN, which seems handles well with prepare_data.sh

Looking forward your guys' reply, I'll also work on to see if I can make a quick fix, after all, I think this is an issue should be reported here.

Best regards, Chendi

xuechendi commented 3 years ago

Update:

After debugging, noticed that nan issue was caused by records with numHistory as 1, after filtering out these lines in local_aggregate.py, train codes now worked with prepare_data.sh

FYI