YoumiMa / dreeam

source code for {D}ocument-level {R}elation {E}xtraction with {E}vidence-guided {A}ttention {M}echanism
MIT License
76 stars 16 forks source link

The running of the step2 phase code is always killed #23

Open Wh9511 opened 6 months ago

Wh9511 commented 6 months ago

When I run the step2 bash scripts/infer_distant_bert.sh ${name} ${load_dir} # for BERT command in a 4090+120GB environment, the process is always killed inexplicably, which is weird. But when I found out that I only read the first 1,000 pieces of data in the train_distant.json file, it was successful again. Snipaste_2024-04-23_15-23-32 image

YoumiMa commented 6 months ago

Hi @Wh9511 , thank you for your interest in this project, and I am very sorry for the late reply. It seems to me that the process is killed because the RAM has run out. The experimental environment we are using has 360GB of CPU RAM available. By monitoring memory usage with the htop command, we found that step 2 consumes a maximum of ~220GB of memory. Due to the large number of samples in train_distant.json, if computing resources are limited, we have two alternative solutions:

  1. Split train_distant.json into two or three parts and calculate the attention scores separately for each part. Then, combine the multiple files together for subsequent training. Each file contains a list of np.array, so merging the contents of several lists sequentially should suffice.
  2. Convert the saved attention in run.py at line 158to float16 format: attns.extend([a.to(torch.float16).cpu().numpy() for a in attn]). Please note that this may affect the learning accuracy of the next step.

I hope this will help you solve your problem.