No output in collecting training data

bilgeacun commented 1 year ago

Hi, I'm using the ./run_infer_opt_175b_collect_sp_data.sh script to collect training data, script runs but when it finishes I don't see any output in the location that it prints to store the output:

<RequestProcessor> dir: /private/home/acun/DejaVu/Decentralized_FM_alpha/c4_train
<RequestProcessor> file: c4_train.jsonl
<RequestProcessor>, output file: /private/home/acun/DejaVu/Decentralized_FM_alpha/c4_train/output_c4_train.jsonl

So output_c4_train.jsonl does not get created.

What could be wrong here?

lzcemma commented 1 year ago

Hi, Training data are supposed to store under the path specified in Decentralized_FM_alpha/modules/hf_opt_module_save.py

https://github.com/FMInference/DejaVu/blob/1ee9ff072409adba8e78acf487e2ae207018f907/Decentralized_FM_alpha/modules/hf_opt_module_save.py#L433C36-L433C36

bilgeacun commented 1 year ago

Thanks, I modified the paths there but I don't see anything there either. I'm using the opt1.3b model and modified the arguments in run_infer_opt_175b_collect_sp_data.sh to be:

file=/private/home/acun/DejaVu/Decentralized_FM_alpha/c4_train/c4_train.jsonl    

ARGS="--model-name /checkpoint/acun/dejavu/opt1.3b \
--model-type opt-save \
--seed 42 \
--fp16 \
--num-layers 24 \
--max-layers 24 \
--budget 100 \
--num-iters 1 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 1 \
--world-size 8 --pipeline-group-size 8 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"

Do you see anything wrong here? Or any other parameter I need to change for the 1.3b model?

hustzxd commented 1 year ago

Thanks, I modified the paths there but I don't see anything there either. I'm using the opt1.3b model and modified the arguments in run_infer_opt_175b_collect_sp_data.sh to be:
file=/private/home/acun/DejaVu/Decentralized_FM_alpha/c4_train/c4_train.jsonl    

ARGS="--model-name /checkpoint/acun/dejavu/opt1.3b \
--model-type opt-save \
--seed 42 \
--fp16 \
--num-layers 24 \
--max-layers 24 \
--budget 100 \
--num-iters 1 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 1 \
--world-size 8 --pipeline-group-size 8 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"
Do you see anything wrong here? Or any other parameter I need to change for the 1.3b model?

I think num-layers = max-layers / GPU_num

For example, If I have two GPU

export CUDA_VISIBLE_DEVICES="0,1"

ARGS="--model-name /checkpoint/acun/dejavu/opt1.3b \
--model-type opt-save \
--seed 42 \
--fp16 \
--num-layers 12 \
--max-layers 24 \
--budget 22800 \
--num-iters 2000 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 1 \
--world-size 2 --pipeline-group-size 2 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"

(trap 'kill 0' SIGINT; \
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \
    & \
wait)

bilgeacun commented 1 year ago

In the example script run_infer_opt_175b_collect_sp_data, this does not hold true: num-layers = max-layers / GPU_num --> 96 / 8 = 12 but num-layers is 16.

lzcemma commented 1 year ago

num-layer should be max-layer / GPU_num. I just remembered that I need to change them to the 8 GPU setting, as I usually run with 6 GPU. --num-iters should also be set to larger to ensure it finishes going through all data. All the rest looks fine.

Can you show me how you modified the hp_opt_module_save.py?

bilgeacun commented 1 year ago

Thanks. I only modified the paths in hp_opt_module_save.py, nothing else:

diff --git a/Decentralized_FM_alpha/modules/hf_opt_module_save.py b/Decentralized_FM_alpha/modules/hf_opt_module_save.py
index 804a342..0a9a160 100644
--- a/Decentralized_FM_alpha/modules/hf_opt_module_save.py
+++ b/Decentralized_FM_alpha/modules/hf_opt_module_save.py
@@ -430,7 +430,7 @@ class GPTBlock(OPTDecoderLayer):
         module.self_attn.layer_index = layer_index
         module.fp_i = 0
         module.fp_mlp_query = np.memmap(
-            f"/lustre/fsw/nvresearch/ldm/diffusion/data/175b_c4/mlp_sp_x_{module.layer_index}.mmap",
+            f"/checkpoint/acun/dejavu/ldm/diffusion/data/13b/mlp_sp_x_{module.layer_index}.mmap",
             dtype="float16",
             mode="w+",
             shape=(
@@ -439,7 +439,7 @@ class GPTBlock(OPTDecoderLayer):
             ),
         )
         module.fp_att_query = np.memmap(
-            f"/lustre/fsw/nvresearch/ldm/diffusion/data/175b_c4/att_sp_x_{module.layer_index}.mmap",
+            f"/checkpoint/acun/dejavu/ldm/diffusion/data/13b/att_sp_x_{module.layer_index}.mmap",
             dtype="float16",
             mode="w+",
             shape=(
@@ -448,7 +448,7 @@ class GPTBlock(OPTDecoderLayer):
             ),
         )
         module.fp_label = np.memmap(
-            f"/lustre/fsw/nvresearch/ldm/diffusion/visualization/175b/mlp_label_{module.layer_index}.mmap",
+            f"/checkpoint/acun/dejavu/ldm/diffusion/visualization/13b/mlp_label_{module.layer_index}.mmap",
             dtype="float16",
             mode="w+",
             shape=(
@@ -458,7 +458,7 @@ class GPTBlock(OPTDecoderLayer):
         )
         module.self_attn.fp_i = 0
         module.self_attn.fp_label = np.memmap(
-            f"/lustre/fsw/nvresearch/ldm/diffusion/visualization/175b/score_norm_{module.layer_index}.mmap",
+            f"/checkpoint/acun/dejavu/ldm/diffusion/visualization/13b/score_norm_{module.layer_index}.mmap",
             dtype="float16",
             mode="w+",
             shape=(

lzcemma commented 1 year ago

hp_opt_module_save.py looks fine.

After you run the script with the right argument, you don't see the output in both the save data and output paths? Did you try to run run_infer_opt_175b_c4.sh? Did you see any output file and perplexity there with this script?

bilgeacun commented 1 year ago

Yes, run_infer_opt_175b_c4.sh seems to run fine and I get a perplexity score in the end: perplexity: 15.413179061205755 andoutput_c4_val_opt_175b.jsonl in c4_val folder is populated as well.

It also tries to dump some profiling traces into trace_json folder but that failed because there is no such folder.

bilgeacun commented 12 months ago

@lzcemma I found out the issue is caused by the code trying to load the whole 800GB c4 dataset into memory... Reading the paper more in detail, I found out that "500 random data points from the c4 training dataset" were used for collecting training data. So it's not needed to load the whole dataset. It would be nice to add instructions on github about this.

lzcemma commented 11 months ago

@bilgeacun Thanks for pointing this out. I just updated the c4_train/get_data.py to by default subsample 500 sentences, also made a note in the ReadMe.

bilgeacun commented 11 months ago

Thanks, fix looks good.

czq693497091 commented 5 months ago

Thanks, fix looks good.

Hi, I failed to run the run_infer_opt_175b_collect_sp_data.sh since I didn't have mlp_spx{module.layer_index}.mmap. How can we get this file?

czq693497091 commented 5 months ago

Hi, Training data are supposed to store under the path specified in Decentralized_FM_alpha/modules/hf_opt_module_save.py

https://github.com/FMInference/DejaVu/blob/1ee9ff072409adba8e78acf487e2ae207018f907/Decentralized_FM_alpha/modules/hf_opt_module_save.py#L433C36-L433C36

Hi, do you know how can we get the .mmap files for the opt-175b models or any other smaller models? Thanks!

FMInference / DejaVu

No output in collecting training data #3