Oneflow-Inc / libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
https://libai.readthedocs.io
Apache License 2.0
391 stars 55 forks source link

[Bug]libai test error:File exists: './data_test/bert_data' #428

Closed strint closed 1 year ago

strint commented 1 year ago

Run libai test

libai version: https://github.com/Oneflow-Inc/libai/commit/94eb85ff0131e8dfce953a3a916de7a4f897c647

ONEFLOW_TEST_DEVICE_NUM=4 python3 -m oneflow.distributed.launch --nproc_per_node 4 -m unittest -f tests/models/test_bert.py

Related ci error: https://github.com/Oneflow-Inc/oneflow/actions/runs/3467834474/jobs/5794361433

FileExistsError

Traceback (most recent call last):
  File "/home/xuxiaoyu/dev/libai/tests/models/test_bert.py", line 60, in setUp
    vocab_path = get_data_from_cache(VOCAB_URL, cache_dir, md5=VOCAB_MD5)
  File "/home/xuxiaoyu/dev/libai/libai/utils/file_utils.py", line 292, in get_data_from_cache
    os.makedirs(cache_dir)
  File "/home/xuxiaoyu/miniconda3/envs/oneflow-dev-gcc7-v2/lib/python3.7/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: './data_test/bert_data'
======================================================================

----------------------------------------------------------------------

ERROR: test_bert_eager_with_data_tensor_parallel (tests.models.test_bert.TestBertModel)======================================================================
Ran 1 test in 0.293s
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/xuxiaoyu/dev/libai/tests/models/test_bert.py", line 60, in setUp
    vocab_path = get_data_from_cache(VOCAB_URL, cache_dir, md5=VOCAB_MD5)
  File "/home/xuxiaoyu/dev/libai/libai/utils/file_utils.py", line 292, in get_data_from_cache
    os.makedirs(cache_dir)
  File "/home/xuxiaoyu/miniconda3/envs/oneflow-dev-gcc7-v2/lib/python3.7/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: './data_test/bert_data'
ERROR: test_bert_eager_with_data_tensor_parallel (tests.models.test_bert.TestBertModel)FAILED

 (errors=1)--------------------------------------------------------------------------------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/xuxiaoyu/dev/libai/tests/models/test_bert.py", line 60, in setUp
    vocab_path = get_data_from_cache(VOCAB_URL, cache_dir, md5=VOCAB_MD5)
  File "/home/xuxiaoyu/dev/libai/libai/utils/file_utils.py", line 292, in get_data_from_cache
    os.makedirs(cache_dir)
  File "/home/xuxiaoyu/miniconda3/envs/oneflow-dev-gcc7-v2/lib/python3.7/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: './data_test/bert_data'
Ran 1 test in 0.292s

Sometimes got stuck at Start building model

[11/15 09:41:02 lb.data.data_utils.dataset_utils]:  > saved the index mapping in ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_80mns_509msl_0.10ssp_1234s.npy
[11/15 09:41:02 lb.data.data_utils.dataset_utils]:  > elapsed time to build and save samples mapping (seconds): 0.000520
[11/15 09:41:02 lb.data.data_utils.dataset_utils]:  > loading indexed mapping from ./data_test/bert_data/loss_compara_content_sentence_bert_indexmap_80mns_509msl_0.10ssp_1234s.npy
[11/15 09:41:02 lb.data.data_utils.dataset_utils]:     loaded indexed file in 0.000 seconds
[11/15 09:41:02 lb.data.data_utils.dataset_utils]:     total number of samples: 112
[11/15 09:41:02 lb.engine.default]: Auto-scaling the config to train.train_iter=10, train.warmup_iter=0
[11/15 09:41:02 libai]: > Start building model...
strint commented 1 year ago

发现只要删掉 data_test 目录,就能复现问题

CPFLAME commented 1 year ago

猜测是dist.synchronize()没有生效, 但是我用这个版本的oneflow跑了一下 没有复现出来.

loaded library: /lib/x86_64-linux-gnu/libibverbs.so.1
path: ['/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow']
version: 0.8.1.dev20221102+cu112
git_commit: 60b7ec5
cmake_build_type: Release
rdma: True
mlir: True
strint commented 1 year ago

猜测是dist.synchronize()没有生效, 但是我用这个版本的oneflow跑了一下 没有复现出来.

和 @CPFLAME @xiezipeng-ML 确认是这里的问题。

这个测试可以复现问题:

        import time
        # prepare dataset
        start_time = time.perf_counter()
        if dist.get_local_rank() == 0:  # rank 0 做数据 build,其它rank需要等待
            # download dataset on main process of each node
            get_data_from_cache(VOCAB_URL, cache_dir, md5=VOCAB_MD5)
            get_data_from_cache(BIN_DATA_URL, cache_dir, md5=BIN_DATA_MD5)
            get_data_from_cache(IDX_DATA_URL, cache_dir, md5=IDX_DATA_MD5)
            os.makedirs(TEST_OUTPUT, exist_ok=True)
        #time.sleep(10)
        dist.synchronize()  # 各个 rank 的同步等待的功能失效了
        end_time = time.perf_counter()
        print(f"rank {oneflow.env.get_rank()} get data time cost {end_time - start_time} seconds")  # rank 0 很久,其它 rank 时间很短

这个问题之前 libai commit 没有暴露(因为老数据有缓存,所以没有数据build,也就触发不了这个错误)。

在更新 libai commit 中暴露了(因为新 commit 中有一个新的测试,里面需要新的数据,就会触发数据 build,进而触发错误): https://github.com/Oneflow-Inc/oneflow/actions/runs/3469372110/jobs/5796633149

修复: https://github.com/Oneflow-Inc/oneflow/pull/9351/commits/5dffebcf6f4509fd8d14d42d9af80d90dc5b5055

related pr: https://github.com/Oneflow-Inc/oneflow/pull/9282/files