facebookresearch / fairscale

PyTorch extensions for high performance and large scale training.
Other
3.17k stars 279 forks source link

FSDP: illegal memory access when flatten=True #689

Open min-xu-ai opened 3 years ago

min-xu-ai commented 3 years ago

When working on a model with FSDP wrapping, running into an illegal memory access crash and it went away with flatten=False. I will be debugging it.

min-xu-ai commented 3 years ago

cc: @prigoyal This is the issue I was referring to Giri.

min-xu-ai commented 3 years ago

@zhaojuanmao Hi Yanli, this might be an interesting bug that you may want to take a look since it is related to stability of FSDP. I haven't find time debugging it. If you are interested, I can share the steps to reproduce.

zhaojuanmao commented 3 years ago

@min-xu-ai thanks, I can take this over. Would you please share the steps to reproduce? I can start from there to build a unit test and debug

min-xu-ai commented 3 years ago

Thanks @zhaojuanmao!!! I just reproduced it again:

image

download this file from here: https://dl.fbaipublicfiles.com/min/miniclip.tgz

Warning, it is pretty big. About 1.3GB. After downloading, untag it and run this command:

PYTHONPATH=../f3 python -m torch.distributed.launch --nproc_per_node=2 --use_env main.py --sharded

This assumes that you have fairscale tree in ../f3 dir. This assume you have a system with 2GPUs.

Inside main.py, when FSDP is used, I set flatten to True to trigger this error. When that's false, this error does not happen.

zhaojuanmao commented 3 years ago

@min-xu-ai the dataset file can not found, is there some dataset we need to download and setup? I tried to run "python make_dataset.py" in 'miniclip' folder, but it can not find 'yfcc14m_ids.npy' in the folder, also I feel I need to download ''/datasets01/yfcc100m/090517/yfcc100m_dataset.txt'' somewhere as well

min-xu-ai commented 3 years ago

@min-xu-ai the dataset file can not found, is there some dataset we need to download and setup? I tried to run "python make_dataset.py" in 'miniclip' folder, but it can not find 'yfcc14m_ids.npy' in the folder, also I feel I need to download ''/datasets01/yfcc100m/090517/yfcc100m_dataset.txt'' somewhere as well

There is a big .pkl file in the tar file, which I thought was the data needed. This code opens it:

image

I don't think you need to run make_dataset.py since the pkl file is already made. Can you double check you are running main.py from the dir where the pkl file exists?

zhaojuanmao commented 3 years ago

@min-xu-ai yes, both pkl file and main.py are in 'miniclip' dir, I am running main.py in 'miniclip' dir.

'PYTHONPATH=~/fairscale python -m torch.distributed.launch --nproc_per_node=2 --use_env main.py --sharded'

It failed to import ''/datasets01/yfcc100m/090517/yfcc100m_dataset.txt'' file, which I do not know where it is.... . see errors

=> creating dataset Traceback (most recent call last): File "main.py", line 549, in main(args) File "main.py", line 182, in main val_dataset = datasets.ImageFolder(os.path.join(args.imagenet, 'val'), val_transform) File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 310, in init super(ImageFolder, self).init(root, loader, IMG_EXTENSIONS if is_valid_file is None else None, File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 145, in init classes, class_to_idx = self.find_classes(self.root) File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 221, in find_classes return find_classes(directory) File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 40, in find_classes classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir()) FileNotFoundError: [Errno 2] No such file or directory: '/datasets01/imagenet_full_size/061417/val' Traceback (most recent call last): File "main.py", line 549, in main(args) File "main.py", line 182, in main val_dataset = datasets.ImageFolder(os.path.join(args.imagenet, 'val'), val_transform) File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 310, in init super(ImageFolder, self).init(root, loader, IMG_EXTENSIONS if is_valid_file is None else None, File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 145, in init classes, class_to_idx = self.find_classes(self.root) File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 221, in find_classes return find_classes(directory) File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 40, in find_classes classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir()) FileNotFoundError: [Errno 2] No such file or directory: '/datasets01/imagenet_full_size/061417/val'

min-xu-ai commented 3 years ago

Sorry about that. I think it is trying to get ImageNet folder from your error msg. I hacked the code this way:

diff --git a/main.py b/main.py
index a1aaecb..0304fbf 100755
--- a/main.py
+++ b/main.py
@@ -179,13 +179,15 @@ def main(args):
         train_dataset = datasets.ImageFolder(os.path.join(args.imagenet, 'train'), val_transform)
     else:
         train_dataset = YFCC14MDataset(args.data, args.size, train_transform, tokenizer)
-    val_dataset = datasets.ImageFolder(os.path.join(args.imagenet, 'val'), val_transform)
+    #val_dataset = datasets.ImageFolder(os.path.join(args.imagenet, 'val'), val_transform)
+    val_dataset = None

     # dist eval resamples data to pad uneven batch sizes
     # make sure num_samples = 0 mod num_gpus for exact acc
     if args.distributed:
         train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
-        val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset)
+        #val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset)
+        val_sampler = None
     else:
         train_sampler = None
         val_sampler = None
@@ -194,9 +196,9 @@ def main(args):
         train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
         num_workers=args.workers, pin_memory=True, sampler=train_sampler, drop_last=True)

-    val_loader = torch.utils.data.DataLoader(
-        val_dataset, batch_size=args.batch_size, shuffle=(val_sampler is None),
-        num_workers=args.workers, pin_memory=True, sampler=val_sampler, drop_last=False)
+    #val_loader = torch.utils.data.DataLoader(
+    #    val_dataset, batch_size=args.batch_size, shuffle=(val_sampler is None),
+    #    num_workers=args.workers, pin_memory=True, sampler=val_sampler, drop_last=False)

     if args.evaluate:
         validate_zeroshot(val_loader, model, tokenizer, args)

I couldn't find anywhere in the code it tries to access yfcc100m_dataset.txt though. I think that is when you try to run make_dataset.py, which is not needed. With the diff above, I was able to still get the illegal memory access. Sorry, I am off tomorrow. I may reply later tomorrow.

zhaojuanmao commented 3 years ago

now the error is:

Traceback (most recent call last): File "main.py", line 551, in main(args) File "main.py", line 216, in main train_stats = train(train_loader, model, criterion, optimizer, scaler, epoch, args) File "main.py", line 261, in train for i, (images, captions) in enumerate(train_loader): File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise raise self.exc_type(msg) FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yanlizhao/venv2/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yanlizhao/miniclip_sdp/dataset.py", line 48, in getitem img = loader(path_zip, file_img) File "/home/yanlizhao/miniclip_sdp/dataset.py", line 16, in loader with zipfile.ZipFile(path_zip, 'r') as myzip: File "/home/yanlizhao/local/anaconda3/lib/python3.8/zipfile.py", line 1251, in init self.fp = io.open(file, filemode) FileNotFoundError: [Errno 2] No such file or directory: '/datasets01/yfcc100m/090517/images/63/125.zip'

that means it is trying to load train_dataset under the folder ''/datasets01/yfcc100m/090517'', how do you get around this? did you make some changes to file 'dataset.py'?

min-xu-ai commented 3 years ago

I see. I wished I have a machine I can easily reproduce it. Sorry about the back and forth. I think I got it right now.

Please download this file: https://dl.fbaipublicfiles.com/min/fake_data.tgz

untar it in the mini clip dir:

cd miniclip_sgd
wget https://dl.fbaipublicfiles.com/min/fake_data.tgz
tar zxvf fake_data.tgz

After this, use --data to override the default data dir with this fake_data dir:

PYTHONPATH=../f3 python -m torch.distributed.launch --nproc_per_node=2 --use_env main.py --sharded --data ./fake_data

Please let me know if this works or not.

zhaojuanmao commented 3 years ago

@min-xu-ai Thanks! it works now!

suchenzang commented 3 years ago

@zhaojuanmao For context - does this bug only come up with ShardedGradScaler usage?

zhaojuanmao commented 3 years ago

@suchenzang it seems not to be related to ShardedGradScaler, it happened when flatten=True only