Closed CoinCheung closed 4 days ago
Hi @CoinCheung,
Thank you for reaching out. I'm afraid the error message you shared doesn't provide any clue what might have gone wrong. I tried to run your code but all I get is:
torchrun --nproc_per_node=1 test153.py
Traceback (most recent call last):
File "/home/user/Dali/dali/test153.py", line 263, in <module>
dataloader, total_iters = create_dali_loader()
File "/home/user/Dali/dali/test153.py", line 222, in create_dali_loader
source = ExternalInputIterator(
File "/home/user/Dali/dali/test153.py", line 48, in __init__
with open(file_anno, 'r') as fr:
FileNotFoundError: [Errno 2] No such file or directory: './datasets/pil_save/lmdb/all_dedup_lmdb.txt.shape'
Exception ignored in: <function ExternalInputIterator.__del__ at 0x7a224b7c9a20>
Traceback (most recent call last):
File "/home/user/Dali/dali/test153.py", line 78, in __del__
if not self.env is None:
AttributeError: 'ExternalInputIterator' object has no attribute 'env'
[2024-09-18 09:20:28,890] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2171562) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
what makes me think if the code you provided is complete/self-contained? Does your problem reproduce without using rand_augment
?
Yes, if I comment out the line about rand-augment, there code can work well. The error message you posted above is associated with this line:
I created a txt file, and each line of which is a path to an image on the harddisk. Then I assign this txt file in the above line, thus the code should be able to work.
@CoinCheung,
I created a txt file, and each line of which is a path to an image on the harddisk. Then I assign this txt file in the above line, thus the code should be able to work.
Can you extend the example to generate all the necessary prerequisites so I'm sure that I'm running the same code as you do?
@JanuszL Hi, here is a piece of sample code, you can run torchrun --nproc_per_node=4 main.py
to see the error. Please download this: https://github.com/CoinCheung/eewee/releases/download/0.0.0/sample.zip
Hello @CoinCheung The bug is confirmed. Serendipitously, we've fixed it recently while working on another feature. Please try latest nightly build (from Sep 17th) - it should fix the problem. The upcoming release 1.42 will include the fix.
Thanks for telling me this, I will wait the new release.
Version
1.31.0
Describe the bug.
When adding
rand_augment
, the program crashesMinimum reproducible example
Relevant log output
Other/Misc.
No response
Check for duplicates