MemoryError - Amazon dataset

adryyandc commented 6 years ago

Hi,

I get memory error while trying to train on Amazon aggressively deduplicated data. I have 64 GB of memory installed on my system and a 1080ti installed.

I run command inside an LXD container.

root@ub-16-sentiment:~/work/sentiment-discovery# python3 main.py --data /home/adrianc/work/Sentiment/dataset/amazon/aggressive_dedup.json --lazy --loose_json --text_key reviewText --label_key overall --num_shards 1002 --optim Adam --split 1000,1,1
configuring data
Traceback (most recent call last):
  File "main.py", line 135, in <module>
    train_data, val_data, test_data = data_config.apply(args)
  File "/root/work/sentiment-discovery/configure_data.py", line 16, in apply
    return make_loaders(opt)
  File "/root/work/sentiment-discovery/configure_data.py", line 63, in make_loaders
    train = data_utils.make_dataset(**data_set_args)
  File "/root/work/sentiment-discovery/data_utils/__init__.py", line 133, in make_dataset
    binarize_sent=binarize_sent, delim=delim, drop_unlabeled=drop_unlabeled, loose=loose)
  File "/root/work/sentiment-discovery/data_utils/__init__.py", line 103, in handle_lazy
    binarize_sent=binarize_sent, delim=delim, drop_unlabeled=drop_unlabeled, ds=data_set)
  File "/root/work/sentiment-discovery/data_utils/__init__.py", line 54, in get_lazy
    make_lazy(processed_path, ds.X, data_type=data_shard)
  File "/root/work/sentiment-discovery/data_utils/lazy_loader.py", line 33, in make_lazy
    f.write(''.join(strs))
MemoryError

The problem is memory usage get's beyond 64 GB.

Regards, Adrian

raulpuric commented 6 years ago

Would you mind trying this for the make lazy function?

def make_lazy(path, strs, data_type='data'):
    """make lazy version of file"""
    lazypath = get_lazy_path(path)
    if not os.path.exists(lazypath):
        os.makedirs(lazypath)
    datapath = os.path.join(lazypath, data_type)
    lenpath = os.path.join(lazypath, data_type+'.len.pkl')
    if not torch.distributed._initialized or torch.distributed.get_rank() == 0:
        with open(datapath, 'w') as f:
            str_ends = []
            str_cnt = 0
            for s in strs:
                f.write(s)
                str_cnt += len(s)
                str_ends.append(str_cnt)
        pkl.dump(str_ends, open(lenpath, 'wb'))
    else:
        while not os.path.exists(lenpath):
            time.sleep(1)

adryyandc commented 6 years ago

That function fixed the error, but i get another error:

root@ub-16-sentiment:~/work/sentiment-discovery# python3 main.py --data /home/adrianc/work/Sentiment/dataset/amazon/aggressive_dedup.json --lazy --loose_json --text_key reviewText --label_key overall --num_shards 1002 --optim Adam --split 1000,1,1
configuring data
Creating mlstm
* number of parameters: 86294784
Traceback (most recent call last):
  File "main.py", line 303, in <module>
    val_loss = train(total_iters)
  File "main.py", line 233, in train
    for i, batch in enumerate(train_data):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 451, in __iter__
    return _DataLoaderIter(self)
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 239, in __init__
    w.start()
  File "/root/anaconda3/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/root/anaconda3/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/root/anaconda3/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/root/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 26, in __init__
    self._launch(process_obj)
  File "/root/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 73, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Adrian

raulpuric commented 6 years ago

Let me find a machine with 64GB to repro. What is LXD?

adryyandc commented 6 years ago

lxd --version 2.18 with ubuntu 16.04 and installed tools via conda.

Do you think this is related to GPU memory? I can do some verification!

adryyandc commented 6 years ago

Never mind, it got fixed when i installed the latest GPU driver:

wget https://us.download.nvidia.com/XFree86/Linux-x86_64/390.67/NVIDIA-Linux-x86_64-390.67.run

adryyandc commented 6 years ago

Now i get this error:

root@ub-16-sentiment:~/work/sentiment-discovery# python3 main.py --data /home/adrianc/work/Sentiment/dataset/amazon/aggressive_dedup.json --lazy --loose_json --text_key reviewText --label_key overall --num_shards 1002 --optim Adam --split 1000,1,1
configuring data
Creating mlstm
* number of parameters: 86294784
main.py:266: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  cur_loss = total_loss[0] / args.log_interval
| epoch   1 |     0/1180027 batches | lr 5.00E-04 | ms/batch 5.407E+01 |                   loss 5.55E-02 | ppl     1.06 | loss scale     1.00
Traceback (most recent call last):
  File "main.py", line 303, in <module>
    val_loss = train(total_iters)
  File "main.py", line 233, in train
    for i, batch in enumerate(train_data):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 286, in __next__
    return self._process_next_batch(batch)
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
TypeError: function takes exactly 5 arguments (1 given)

raulpuric commented 6 years ago

Hmmm. Someone else seems to be getting the same issue with the latest pytorch version. Let me look into this a little bit more.

adryyandc commented 6 years ago

Ok. I installed it with conda to be truth: root@ub-16-sentiment:~/work/sentiment-discovery# conda list | grep pytorch cuda90 1.0 h6433d27_0 pytorch pytorch 0.4.0 py36_cuda9.0.176_cudnn7.1.2_1 [cuda90] pytorch

raulpuric commented 6 years ago

So I am not able to get the same error while on master in my container.

There's several things to try here:

can you try running simply python main.py --lazy and let me know what you get.
I will try and use the public pt0.4 container
can you try installing with pip instead of conda

in configure_data.py can you try running with num_workers: 0 and/or pin_memory: False in line 34


data_loader_args = {'num_workers': 1, 'shuffle': opt.shuffle, 'batch_size': batch_size,
        'pin_memory': True, 'transpose': opt.transpose, 'distributed': opt.world_size > 1,
        'rank': opt.rank, 'world_size': opt.world_size, 'drop_last': opt.world_size > 1}```

adryyandc commented 6 years ago

I will tomorrow when i get to work.

raulpuric commented 6 years ago

Let me also know if this PR works for you. https://github.com/NVIDIA/sentiment-discovery/pull/32

This works for me on the public pt0.4 container.

adryyandc commented 6 years ago

With the lazy patch i get the following error:

root@ub-16-sentiment:~/work/sentiment-discovery# python3 main.py --data /home/adrianc/work/Sentiment/dataset/amazon/aggressive_dedup.json --lazy --loose_json --text_key reviewText --label_key overall --num_shards 1002 --optim Adam --split 1000,1,1
configuring data
Traceback (most recent call last):
  File "main.py", line 135, in <module>
    train_data, val_data, test_data = data_config.apply(args)
  File "/root/work/sentiment-discovery/configure_data.py", line 16, in apply
    return make_loaders(opt)
  File "/root/work/sentiment-discovery/configure_data.py", line 63, in make_loaders
    train = data_utils.make_dataset(**data_set_args)
  File "/root/work/sentiment-discovery/data_utils/__init__.py", line 133, in make_dataset
    binarize_sent=binarize_sent, delim=delim, drop_unlabeled=drop_unlabeled, loose=loose)
  File "/root/work/sentiment-discovery/data_utils/__init__.py", line 112, in handle_lazy
    binarize_sent=binarize_sent, delim=delim, drop_unlabeled=drop_unlabeled)
  File "/root/work/sentiment-discovery/data_utils/__init__.py", line 56, in get_lazy
    return lazy_array_loader(processed_path, data_type=data_shard)
  File "/root/work/sentiment-discovery/data_utils/lazy_loader.py", line 68, in __init__
    self.read_lock = Lock()
NameError: name 'Lock' is not defined

P.S.: Fixed it with: import threading [...] and threading.Lock()

But i still get the error:

  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
TypeError: function takes exactly 5 arguments (1 given)

I will try compiling PyTorch from sources. What PyTorch commit hash do you use?

1) Running python main.py --lazy works just fine:

root@ub-16-sentiment:~/work/sentiment-discovery# python main.py --lazy
configuring data
Creating mlstm
* number of parameters: 86294784
main.py:266: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  cur_loss = total_loss[0] / args.log_interval
| epoch   1 |     0/ 1930 batches | lr 5.00E-04 | ms/batch 3.202E+01 |                   loss 5.54E-02 | ppl     1.06 | loss scale     1.00
| epoch   1 |   100/ 1930 batches | lr 4.74E-04 | ms/batch 2.947E+03 |                   loss 2.46E+00 | ppl    11.68 | loss scale     1.00

2) Running with running with num_workers: 0 i get error:

Traceback (most recent call last):
  File "main.py", line 303, in <module>
    val_loss = train(total_iters)
  File "main.py", line 233, in train
    for i, batch in enumerate(train_data):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 264, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 264, in <listcomp>
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/root/work/sentiment-discovery/data_utils/datasets.py", line 375, in __getitem__
    s = self.all_strs[other_str_idx]
  File "/root/work/sentiment-discovery/data_utils/lazy_loader.py", line 79, in __getitem__
    return self.file_read(start, end)
  File "/root/work/sentiment-discovery/data_utils/lazy_loader.py", line 109, in file_read
    rtn = rtn.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 2826-2827: unexpected end of data

3) Running with pin_memory: False i get same error:

  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
TypeError: function takes exactly 5 arguments (1 given)

raulpuric commented 6 years ago

My apologies. I just added back the import statement now.

The first command indicates to me that this is specific to the amazon dataset or --loose_json flag. Let me re download a a fresh copy of the amazon dataset, and test it myself.

Thanks for your patience.

adryyandc commented 6 years ago

No problem!

Maybe is my conda installation a problem. I tried with smaller dataset (amazon 5-core) and i get another error:

root@ub-16-sentiment:~/work/sentiment-discovery# python3 main.py --data /home/adrianc/work/Sentiment/dataset/amazon/kcore5.json --lazy --loose_json --text_key reviewText --label_key overall --num_shards 1002 --optim Adam --split 1000,1,1
configuring data
Creating mlstm
* number of parameters: 86294784
main.py:266: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  cur_loss = total_loss[0] / args.log_interval
| epoch   1 |     0/677428 batches | lr 5.00E-04 | ms/batch 4.036E+01 |                   loss 5.55E-02 | ppl     1.06 | loss scale     1.00
Traceback (most recent call last):
  File "main.py", line 303, in <module>
    val_loss = train(total_iters)
  File "main.py", line 233, in train
    for i, batch in enumerate(train_data):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 286, in __next__
    return self._process_next_batch(batch)
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/root/work/sentiment-discovery/data_utils/loaders.py", line 65, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/root/work/sentiment-discovery/data_utils/loaders.py", line 65, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/root/work/sentiment-discovery/data_utils/loaders.py", line 51, in default_collate
    return torch.stack([torch.from_numpy(b) for b in batch], 0)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 257 and 261 in dimension 1 at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/TH/generic/THTensorMath.c:3586

It is not related to --loose_json flag since i get same error with or without.

raulpuric commented 6 years ago

try #32 again. Fixed the UnicodeDecodeError I believe.

adryyandc commented 6 years ago

Thank you! Seems working now, at least for a few hundred of batches.

raulpuric commented 6 years ago

Awesome. Going to merge the change in then.

NVIDIA / sentiment-discovery

MemoryError - Amazon dataset #31