facebookresearch / UNIREX

This is the official PyTorch repo for "UNIREX: A Unified Learning Framework for Language Model Rationale Extraction" (ICML 2022).
23 stars 3 forks source link

No such file or directory: '/home/xxx/logs/multiruns/2023-06-19/18-07-25/0/main.py' #4

Open wj210 opened 1 year ago

wj210 commented 1 year ago

Hi, i am encounter this error, similar to the other issue listed. I am using cuda-11.7 thus I couldn't use the cuda version listed in the requirements.txt as it is not compatible. conda_packages.txt Also, this seems to happen when i am using multi-gpu, setting gpus = [2,3] causes this error, and if i dont use multi-gpu i run into OOM issues despite setting batch size to 8 and using a 40gb gpu for movies dataset

Besides this, I have some questions regarding the build_dataset.py. For cos-e dataset, it seems that for each datapoint, there are 5 inputs, each corresponding to 1 answer option, why is that? wouldn't the entire option list be included in each example?

aarzchan commented 1 year ago

Hi @wj210, thanks for reaching out!

i am encounter this error, similar to the other issue listed.

To help me debug this issue, could you please share the following items:

Also, this seems to happen when i am using multi-gpu, setting gpus = [2,3] causes this error, and if i dont use multi-gpu i run into OOM issues despite setting batch size to 8 and using a 40gb gpu for movies dataset

To check if this issue is related to multi-gpu training, could you try reducing the batch size to 1 for single-GPU training? If the Movies dataset still yields OOM errors for batch size 1, you can try one of the datasets with shorter input sequences (e.g., SST).

For cos-e dataset, it seems that for each datapoint, there are 5 inputs, each corresponding to 1 answer option, why is that? wouldn't the entire option list be included in each example?

In CoS-E, the label space is not fixed (i.e., the five answer choices are different for every instance). Thus, if all five answer choices are provided in a single input sequence, it is unclear what the model's output logits would represent. For example, suppose all five answer choices are given as input, and we design the model to output five logits instead of one. In this case, how would the model map the given instance's five answer choices to the five logits, when every instance has a different set of answer choices?

In light of this, we instead follow prior works by using the following setup:

wj210 commented 1 year ago

this is the error stack: Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/weijie/anaconda3/envs/unirex/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/home/weijie/anaconda3/envs/unirex/lib/python3.8/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/home/weijie/anaconda3/envs/unirex/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/home/weijie/anaconda3/envs/unirex/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/home/weijie/anaconda3/envs/unirex/lib/python3.8/runpy.py", line 264, in run_path code, fname = _get_code_from_file(run_name, path_name) File "/home/weijie/anaconda3/envs/unirex/lib/python3.8/runpy.py", line 234, in _get_code_from_file with io.open_code(decoded_path) as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/weijie/logs/multiruns/2023-06-22/17-38-37/0/main.py'

I was using the command for SLM with all 3 coefficients: python main.py -m \ save_checkpoint=True \ data=movies \ model=expl_reg \ model.explainer_type=lm \ model.expl_head_type=mlp \ model.expl_head_mlp_hidden_dim=2048 \ model.expl_head_mlp_hidden_layers=2 \ model.task_wt=1.0 \ model.comp_wt=0.5 \ model.suff_wt=0.5 \ model.plaus_wt=0.5 \ model.optimizer.lr=2e-5 \ setup.train_batch_size=2 \ setup.accumulate_grad_batches=1 \ setup.eff_train_batch_size=2 \ setup.eval_batch_size=2 \ setup.num_workers=3 \ seed=0 \

I ran the command in the same dir path as main.py, also I change some settings in training config.yaml by setting work_dir: ${hydra:runtime.cwd} data_dir: '${work_dir}/data' log_dir: '${work_dir}/logs' save_dir: '${work_dir}/save' and in trainer defaults.yaml: gpus: [0,1] auto_select_gpus: False

I tried with single GPU of batch size 1, it is weird that even that causes OOM errors for movie dataset.

aarzchan commented 1 year ago

FileNotFoundError: [Errno 2] No such file or directory: '/home/weijie/logs/multiruns/2023-06-22/17-38-37/0/main.py' I ran the command in the same dir path as main.py, also I change some settings in training config.yaml by setting work_dir: ${hydra:runtime.cwd} data_dir: '${work_dir}/data' log_dir: '${work_dir}/logs' save_dir: '${work_dir}/save'

Hmm, I'm a bit confused about the dir structure. The error suggests that your actual work_dir is set to /home/weijie/logs/multiruns/2023-06-22/17-38-37/0/, as the code expects this dir to contain main.py. However, because you set log_dir: '${work_dir}/logs', then there also exists a /home/weijie/logs/multiruns/2023-06-22/17-38-37/0/logs dir, which is strange since there are two logs in the path. Plus, the fact that you have a logs dir in /home/weijie indicates that your intended work_dir is /home/weijie, which is strange too (i.e., I thought it would be something like /home/weijie/UNIREX).

Could you try setting a breakpoint here, then (1) checking the value of cfg.work_dir and (2) running these lines:

from hydra.utils import get_original_cwd
get_original_cwd()

Ideally, cfg.work_dir and get_original_cwd() should both return the same dir. Please let me know what dir value(s) you get.

Also, could you write out your intended dir structure? For example, mine is:

UNIREX-root (top-level dir that I created externally)
- UNIREX (this repo)
   - main.py
   - ...
- data
- logs
- save

I tried with single GPU of batch size 1, it is weird that even that causes OOM errors for movie dataset.

Movies has very long inputs (by default, we truncate to max_length = 1024), so it's possible that your GPU doesn't have enough memory even for batch size 1. For reference, all of our experiments were done using NVIDIA A100 40GB GPUs. Could you share which GPU(s) you're using?

For debugging purposes, as I suggested in my previous comment, could you start by using SST instead of Movies? SST's inputs are much shorter and unlikely to cause OOM with batch size 1. If you still get OOM with SST, you can try reprocessing SST (i.e., via build_dataset.py) using max_length < 67.