Open Wh9511 opened 6 months ago
Hi @Wh9511 , thank you for your interest in this project, and I am very sorry for the late reply.
It seems to me that the process is killed because the RAM has run out.
The experimental environment we are using has 360GB of CPU RAM available. By monitoring memory usage with the htop
command, we found that step 2 consumes a maximum of ~220GB of memory. Due to the large number of samples in train_distant.json
, if computing resources are limited, we have two alternative solutions:
train_distant.json
into two or three parts and calculate the attention scores separately for each part. Then, combine the multiple files together for subsequent training. Each file contains a list of np.array
, so merging the contents of several lists sequentially should suffice.run.py
at line 158to float16 format: attns.extend([a.to(torch.float16).cpu().numpy() for a in attn]
). Please note that this may affect the learning accuracy of the next step.I hope this will help you solve your problem.
When I run the step2 bash scripts/infer_distant_bert.sh ${name} ${load_dir} # for BERT command in a 4090+120GB environment, the process is always killed inexplicably, which is weird. But when I found out that I only read the first 1,000 pieces of data in the train_distant.json file, it was successful again.