The running of the step2 phase code is always killed

Hi @Wh9511 , thank you for your interest in this project, and I am very sorry for the late reply. It seems to me that the process is killed because the RAM has run out. The experimental environment we are using has 360GB of CPU RAM available. By monitoring memory usage with the htop command, we found that step 2 consumes a maximum of ~220GB of memory. Due to the large number of samples in train_distant.json, if computing resources are limited, we have two alternative solutions:

Split train_distant.json into two or three parts and calculate the attention scores separately for each part. Then, combine the multiple files together for subsequent training. Each file contains a list of np.array, so merging the contents of several lists sequentially should suffice.
Convert the saved attention in run.py at line 158to float16 format: attns.extend([a.to(torch.float16).cpu().numpy() for a in attn]). Please note that this may affect the learning accuracy of the next step.

I hope this will help you solve your problem.

YoumiMa / dreeam

The running of the step2 phase code is always killed #23