YoumiMa / dreeam

source code for {D}ocument-level {R}elation {E}xtraction with {E}vidence-guided {A}ttention {M}echanism
MIT License
75 stars 16 forks source link

Python process killed when running infer_distant script #9

Closed jefflink closed 1 year ago

jefflink commented 1 year ago

Can I check if running the infer_distance script requires alot of RAM? I have a 48GB GPU card with 128GB RAM, however running both infer_distant_bert or infer_distant_roberta will result in Python killed during evaluating batches. For example:

scripts/infer_distant_roberta.sh: line 17:  3218 Killed                  python run.py --data_dir dataset/docred --transformer_type roberta --model_name_or_path roberta-large --display_name ${NAME} --load_path ${LOAD_DIR} --eval_mode single --test_file train_distant.json --test_batch_size 4 --num_labels 4 --evi_thresh 0.2 --num_class 97 --save_attn
YoumiMa commented 1 year ago

Hi @jefflink, thank you for your interest in this project! I suspect the program got killed due to a lack of CPU memory, as the evaluation process currently converts the predictions into a numpy array and saves the array onto the CPU for further computations (see L.141-161 at run.py). Since the distantly supervised data is large, this operation may require quite some CPU memory.

We adopt this part of code directly from the codebase without modification, but optimizing to save some memory seems possible. I will give it a try to improve the memory efficiency later. Before the optimization, one solution could be to split train_distant.json into several parts, run the inference script on each part and then concatenate the evaluated results.

I hope this is clear and helps you solve your problem.

jefflink commented 1 year ago

Thanks @YoumiMa . I'll take a look at the script. Just surprised that even with 128GB of RAM it is not sufficient.

Winson-Huang commented 1 year ago

@jefflink Hi, I wonder if you solve this problem? I meet exactly the same situation. Thank u!

jefflink commented 1 year ago

@Winson-Huang Hi! Unfortunately I did not managed to fix it.