some questions about training

jessapinkman commented 8 months ago

hi,

When I try to run the main script to train the model, I get the following problem: Traceback (most recent call last): File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 272, in model, train_metrics, dev_loader = run_training_loop(params, outf=f, serialdir=serialdir, config=CONFIG) File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 131, in run_training_loop metrics = trainer.train() File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 771, in train metrics, epoch = self._try_train() File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 793, in _try_train train_metrics = self._train_epoch(epoch) File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 515, in _train_epoch raise ValueError("nan loss encountered") ValueError: nan loss encountered

This looks like the data has some invalid values, how do I handle it?

also， here is my directory about dataset, is that correct?

├─ data │ ├─ daic │ │ ├─ 300_TRANSCRIPT.csv │ │ ├─ 301_TRANSCRIPT.csv │ │ └─ 304_TRANSCRIPT.csv │ ├─ dailydialog │ │ ├─ .DS_Store │ │ ├─ dialogues_act.txt │ │ ├─ dialogues_emotion.txt │ │ ├─ dialogues_text.txt │ │ ├─ dialogues_topic.txt │ │ ├─ ijcnlp_dailydialog │ │ │ ├─ .DS_Store │ │ │ ├─ dialogues_act.txt │ │ │ ├─ dialogues_emotion.txt │ │ │ ├─ dialogues_text.txt │ │ │ ├─ dialogues_topic.txt │ │ │ ├─ readme.txt │ │ │ ├─ test.zip │ │ │ ├─ train.zip │ │ │ └─ validation.zip │ │ ├─ readme.txt │ │ ├─ test │ │ │ ├─ dialogues_act_test.txt │ │ │ ├─ dialogues_emotion_test.txt │ │ │ └─ dialogues_test.txt │ │ ├─ train │ │ │ ├─ dialogues_act_train.txt │ │ │ ├─ dialogues_emotion_train.txt │ │ │ └─ dialogues_train.txt │ │ └─ validation │ │ ├─ dialogues_act_validation.txt │ │ ├─ dialogues_emotion_validation.txt │ │ └─ dialogues_validation.txt │ └─ ijcnlp_dailydialog.zip

jessapinkman commented 8 months ago

btw， could u send me the whole DAIC dataset like xxx_TRANSCRIPT? Downloading many .zip files is too time-consuming.

My email is: pinkman@stu.xjtu.edu.cn

Thanks!

jessapinkman commented 8 months ago

@chuyuanli I would appreciate it if u could help me.

chuyuanli commented 8 months ago

Hello, thanks for your interest. It is not clear how you encountered that issue. I would guess that the gold labels are not given or not in the correct format. Carefully check the input and output formats. Labels should be converted into integers, for instance.

About the DAIC data, you need to submit a request and then you can download the files. An example of the data repository is given in this git. Your repository looks fine. You can also check the code in dataset_reader.py for details.

Hope this helps.

jessapinkman commented 8 months ago

Hello, thanks for your interest. It is not clear how you encountered that issue. I would guess that the gold labels are not given or not in the correct format. Carefully check the input and output formats. Labels should be converted into integers, for instance.

About the DAIC data, you need to submit a request and then you can download the files. An example of the data repository is given in this git. Your repository looks fine. You can also check the code in dataset_reader.py for details.

Hope this helps.

Thank you for your reply.

In order to quickly start training the model, I only downloaded part of the DAIC data set. After I applied for the data set, they gave me such a URL, but I need to download each sample one by one, which is time-consuming. Do you still keep the entire data set? I only need the text file of each sample.

The "nan loss" problem I encountered occurred when loading the data set. I just ran the main.py file, which seemed to not fully start training the model. I tried to print out the shapes and data of label_act, label_emo, label_phq, and label_topic. Every lable has the same shape as the predictions tensor without the num_classes dimension. And i did not find the missing value "nan" in the data, but there were some negative integers "-1". Is this possibly the reason for the error? Here is the print from my console:

(mtl) C:\Users\jessa\Desktop\MTL4Depr-master>D:/conda/envs/mtl/python.exe c:/Users/jessa/Desktop/MTL4Depr-master/src/main.py 11118 1000 1000 3 2 5 building vocab: 100%|##########| 11121/11121 [00:00<00:00, 29514.59it/s] Building the model... D:\conda\envs\mtl\lib\site-packages\torch\cuda\memory.py:278: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. warnings.warn( 0%| | 0/696 [00:00<?, ?it/s]tensor([[ 0, 1, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 1, 0, 1, 0, 1, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, -1, -1, -1, -1], [ 2, 3, 1, 0, 1, 0, 1, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 1, 0, 2, 3, 1, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 2, 3, 2, 3, 0, 1, 2, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 2, 3, 0, 0], [ 1, 0, 1, 0, 2, 2, 3, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 2, 2, 3, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 1, 0, 1, 0, 1, 0, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 1, 1, 0, 2, 3, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 2, 0, 1, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]], device='cuda:0') tensor([[ 6, 6, 4, 6, 4, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 4, 4, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 6, 0, 0, 0, 0, 0, 4, 0, 0, 4, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 5, 6, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [ 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 0, 0, 0, 0, 0, 0, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 0, 0, 6, 4, 4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [ 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]], device='cuda:0') tensor([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0') tensor([4, 7, 4, 0, 4, 4, 4, 4, 0, 4, 5, 3, 7, 0, 4, 7], device='cuda:0') 0%| | 0/696 [00:00<?, ?it/s] Traceback (most recent call last): File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 273, in model, train_metrics, dev_loader = run_training_loop(params, outf=f, serialdir=serialdir, config=CONFIG) File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 131, in run_training_loop metrics = trainer.train() File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 771, in train metrics, epoch = self._try_train() File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 793, in _try_train train_metrics = self._train_epoch(epoch) File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 515, in _train_epoch raise ValueError("nan loss encountered") ValueError: nan loss encountered

jessapinkman commented 8 months ago

Hello, thanks for your interest. It is not clear how you encountered that issue. I would guess that the gold labels are not given or not in the correct format. Carefully check the input and output formats. Labels should be converted into integers, for instance.

About the DAIC data, you need to submit a request and then you can download the files. An example of the data repository is given in this git. Your repository looks fine. You can also check the code in dataset_reader.py for details.

Hope this helps.

Actually, when I set has_emo\has_topic\has_act = False, the model could trained well, but when one of these three parameters is set to True, an error will be reported(raise ValueError("nan loss encountered")). I checked the data and it was caused by the lack of phq label in the dailydialog dataset. In dataset_reader.py you fill in missing values by using -1, but when all labels are -1, the value of the loss is "nan", how can I fix it? @chuyuanli

chuyuanli / MTL4Depr

some questions about training #3