Huge amount of CPU RAM needed during training

davidecaroselli commented 5 years ago

Hello team!

With the current version of fairseq we noticed that a huge amount of RAM (CPU RAM, not GPU RAM) is required in order to run the training. Moreover this is correlated to the number of GPU used on the same machine.

So my guess is that the binarized data used for the training is completely loaded in RAM for every GPU process, this will result in having the amount of CPU RAM is roughly: RAM ~= (number of GPUs) * sizeof(binarized data). If this is true, the amount of RAM needed for medium/large training sets is huge (hundreds of GB) wrt size of training set (less than 100 GB).

If this is the case, why can't we use a memory mapped training set? So that the amount of RAM depends exclusively on sizeof(binarized data) ?

I'm available to work on this if needed, can you please give me some code context, or a good starting point to begin with?

myleott commented 5 years ago

One option is to use --fix-batches-to-gpus, which will load only 1/N of the data in each process, assuming N workers.

mmapping the whole training set per node is reasonable as well and may have some interesting tradeoffs compare to the --fix-batches-to-gpus option.

davidecaroselli commented 5 years ago

From the documentation of --fix-batches-to-gpus I understand that there could be some problems with translation quality (am I correct?), but I don't really understand why.. If the whole cluster should work together to train on the training set without repetitions, why splitting batches per-GPU will result in problematic training?

In other words: which is the advantage of having duplicated batches per GPU? If a batch "k" is run on "GPU X", for sure won't be seen by "GPU Y" until next epoch, is that correct? If this is the case, why preventing batch "k" to be replicated in all GPUs != "GPU X" should be a problem?

Thanks for the info @myleott ! I just want to understand better how fairseq work in order to be more confident into changing the training DataSet implementation.

myleott commented 5 years ago

When using --fix-batches-to-gpus then it will slightly reduce randomness, which can (very slightly) affect model quality if training on many GPUs.

This is because we will never shuffle batches across GPUs, only within a GPU.

Consider the normal case, training with 2 GPUs on a dataset with 4 batches. You might have something like:

             GPU 1     GPU 2
Epoch 0.0   batch 1   batch 2
Epoch 0.5   batch 3   batch 4
Epoch 1.0   batch 1   batch 3
Epoch 1.5   batch 4   batch 2

But with --fix-batches-to-gpus there will be less randomness, since batches 1/3 will always be on GPU 1 and batches 2/4 will always be on GPU 2:

             GPU 1     GPU 2
Epoch 0.0   batch 1   batch 2
Epoch 0.5   batch 3   batch 4
Epoch 1.0   batch 3   batch 4
Epoch 1.5   batch 1   batch 2

I'd suggest trying it out, you may find very similar or even identical results.

davidecaroselli commented 5 years ago

Hi @myleott

thanks for the explanation, this make totally sense! Unfortunately this option does not seems to reduce CPU RAM consumption. I just tried to run my training on a 8GPU/128G RAM machine and the training failed again (side note: on a 8-GPU/256G RAM the training run smoothly).

So I would like to work on the MemoryMapped Dataset, because I think can definitely solve this issue. Can you point me to the parts of code that creates the current Dataset version and where it is used during training? So that I can start digging the code.

Thanks for your help!

PS: Can you re-open the issue? I think this could be useful to track the status of the development and the main place to share any feedbacks or request further info during the development!

myleott commented 5 years ago

Ah, you're right. If you're okay with slightly higher I/O from seeking, you can use the --lazy-load option [1].

This will switch to using an IndexedDataset instead of IndexedCachedDataset [2], which will seek instead of loading everything into memory.

If you want to make a memory mapped version, then you can extend IndexedDataset [3].

[1] https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/translation.py#L56-L57 [2] https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/translation.py#L141-L144 [3] https://github.com/pytorch/fairseq/blob/master/fairseq/data/indexed_dataset.py#L50

davidecaroselli commented 5 years ago

Hi @myleott !

I have an initial implementation of the Dataset, and it seems to work correctly. The problem is that I am having hard time trying to "plug in" the implementation in the flow of Fairseq. The problem I see is that Fairseq preprocess/train is entangled with the internal implementation of the Dataset (i.e. .bin and .idx files are directly accessed from outside the class itself!)

More problematic, here I see a bunch of public variables (https://github.com/pytorch/fairseq/blob/master/fairseq/data/indexed_dataset.py#L68) that are not compatible with the implementation I need. Even more relevant, I see that the IndexedRawTextDataset implementation does not expose them.

So basically I'm having troubles in understanding what is the "public" API every Dataset has to implement and what is instead a private implementation. More important the logic to load/delete/create a Dataset with the right implementation is spread and duplicated across multiple files (i.e. every Task seems to re-implement the loading depending on arguments).

So my question is: can I try to create a solid, consistent FairseqDataset interface that declares all an only public methods of every dataset? This first step should help me a lot in the creation of a new Dataset implementation.

davidecaroselli commented 5 years ago

wait.. I am confused.

There is already a FairseqDataset in fairseq_dataset.py but why no class in indexed_dataset.py extends it?

myleott commented 5 years ago

The "dataset" name is a bit overloaded. IndexedDataset is a standalone module that implements PyTorch's Dataset API.

There is also the FairseqDataset API, which is more closely connected to FairseqTasks. Currently the main two FairseqDatasets are LanguagePairDataset (for translation) and MonolingualDataset (for language modeling). You can extend this, but generally you'd be better off extending TranslationTask and returning a LanguagePairDataset that consists of PyTorch Datasets of your own (e.g., IndexedDataset, MemoryMappedIndexedDataset, etc.).

davidecaroselli commented 5 years ago

Thanks @myleott !

I have just completed the mmap implementation on my fork: https://github.com/davidecaroselli/fairseq/blob/features/mmap_dataset/fairseq/data/indexed_dataset.py#L264 I am now testing if the two implementations (IndexedDataset and MMapDataset) return the same lines after fairseq-preprocess.

In the meantime, I would like to replace the dataset flags --output-format and --lazy-load with a unique --dataset-type with possible values: raw, lazy, cached, mmap so that is much easier to select the correct implementation of the Dataset. Do you agree with such change? If not, what do you suggest?

myleott commented 5 years ago

Cool, that sounds reasonable. It’d be nice to reuse this logic in both preprocess and the Tasks (Translation and LanguageModel).

I’m also unsure if the name might be confusing, maybe dataloader-type instead? Although of course that’s overloaded with the PyTorch concept of a DataLoader :/

davidecaroselli commented 5 years ago

I’m also unsure if the name might be confusing, maybe dataloader-type instead? Although of course that’s overloaded with the PyTorch concept of a DataLoader :/

I see, for my understanding this option should be available in both preprocess.py and train.py, right? Something like --dataset-impl? I would like to keep the name --dataset-xxx because you're actually selecting the right implementation for a class called Dataset. IMHO if we use --dataloader that could lead to more confusion.

myleott commented 5 years ago

--dataset-impl sounds good. I wonder if we can also define this directly in the Task, and add necessary APIs to Task that can be called in preprocess. In general we’re trying to make Task become the primary interface for this kind of stuff, since we’d like to support a wider variety of tasks using the same CLI tools.

davidecaroselli commented 5 years ago

I have just submitted the pull request. I have verified with read_binarized.py that the MMapIndexedDataset contains and returns the exact same data than IndexedDataset. I'm now trying a more large-scale experiment, I will report results here!

I wonder if we can also define this directly in the Task, and add necessary APIs to Task that can be called in preprocess

While I agree that the Task should hold and expose all the details of the problem you're modeling, the dataset implementation does not seem one of them to me. For example a Translation Task supports all the Dataset implementations, this is indeed not something specific of a Task, but a "choice" of the user.

davidecaroselli commented 5 years ago

Hi @myleott any updates on this? I've seen you're doing some pull requests on "sharded dataset", does it solve this issue?

I have finally a bit of spare time to test my memory-mapped solution, but at a first glance it does not seems to help, no idea why! Do the training run some kind of pre-loading of some sort?

If you managed to solve this issue in some other way, I would be more than happy to discard my implementation!

Please let me know if and how I can help. Davide

myleott commented 5 years ago

Hey! Yes, we've been running into these same issues and have some solutions on the way. --lazy-load works somewhat, but still requires loading the whole index into memory. I also tried the mmap solution, but it doesn't seem to limit overall memory usage.

The sharding approach seems to work though, you can follow along here: #696. The idea is that you input a list of datasets (colon-separated), and each one becomes its own "epoch".

myleott commented 5 years ago

I saw you have a new commit where you're seeing big memory savings? I'll try it out :)

davidecaroselli commented 5 years ago

Hi @myleott yes! I spotted the problem with my previous implementation: I was creating all the tensors at startup, so the problem was the overhead - even with mmap data, the single tensor was requiring too much memory.

This new version creates tensor lazily over a unique mmaped memoryview. I was initially scared about the time "overhead" but surprisingly I measure the same exact wps (word-per-second) of the regular cached version, that's great!

I have also run some measures of RAM usage, here's my results:

MODEL SIZE TEST Here I have used a tiny training set, so the RAM usage is due only to the network model itself; this is basically the base RAM consumption independent from training set size.

1 GPU	2 GPUs	4 GPUS	8 GPUS
1931 MB	3884 MB	7646 MB	15469 MB

So we can say that, for a transformer base model, fairseq requires ~1920 MB per GPU

CACHED DATASET This is the same base transformer model training but with a 12.6M lines cached dataset.

1 GPU	2 GPUs	4 GPUS	8 GPUS
6093 MB	9243 MB	15321 MB	27472 MB

By removing the model overhead:

1 GPU	2 GPUs	4 GPUS	8 GPUS
4162 MB	5359 MB	7675 MB	12003 MB

This is the size in RAM of the dataset. Here I see something I did not expect actually. The memory consumption is less than linear with number of GPU, this is not expected. Because every GPU process re-load the dataset entirely, I expected a linear dependency, Maybe some problems in measurements?

MMAP DATASET This is the same base transformer model training but with a 12.6M lines memory-mapped dataset.

1 GPU	2 GPUs	4 GPUS	8 GPUS
2424 MB	4867 MB	9591 MB	19052 MB

By removing the model overhead (we have a 2.9GB of buff/cached memory mapped dataset):

1 GPU	2 GPUs	4 GPUS	8 GPUS
493 MB	983 MB	1945 MB	3583 MB

So here we see some savings. I'm not sure why I still have a ~493Mb per GPU (all dataset is memory mapped, so it should not appear in resident memory). I think this is still some model-dependant data structure.

davidecaroselli commented 5 years ago

Good news!

During the night I run the training of my largest dataset that I was not able to run on my 8-GPU server with 256GB of RAM.

With the new MMapDataset I was able to run it no problem, with this total RAM consumption:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           251G         80G         78G        101M         92G        169G
Swap:            0B          0B          0B

myleott commented 5 years ago

That's awesome! I was just using it a bit too, seems to be working for me too. I did make a change so that it works with num_workers > 0, which requires the object to be pickleable. What do you think? https://github.com/davidecaroselli/fairseq/pull/1

davidecaroselli commented 5 years ago

Thanks @myleott but this line is worrying me: https://github.com/davidecaroselli/fairseq/pull/1/files#diff-3458fbbf9c9c5919b005fe5b64b862ecR358

Because if you make a clone of the sizes tensor, it will be cloned for every GPU (no more MempryMapped). However, I found a solution by excluding those fields from the pickable state. I will update the code and get back to you asap!

davidecaroselli commented 5 years ago

Fix pushed, @myleott now the class should be pickable! I have also updated the memory map file warmup to a much faster implementation.

PS: just curious, after the log line | distributed init (rank 0): tcp://localhost:10614 (one for each GPU) what fairseq is doing? I see RAM ramping up from ~40G to ~90G? I would like to understand what is requesting so much RAM now that the dataset is not requiring so much.

frankang commented 3 years ago

Just to provide another workaround for the RAM issue. I wrapped the mmap dataset with a customized dataset and got the following error

"RuntimeError: DataLoader worker (pid xxx) is killed by signal: Segmentation fault.
...
_pickle.UnpicklingError: pickle data was truncated

It is probably caused by some unpickleable data used in my customized dataset class, but I don't have enough time to look into it, so I tried the dataset-impl=lazy option (with data placed on a SATA SSD), it worked but the speed decreased by 30%. I can clearly see GPUs are not fully saturated from the nvidia-smi command. Then I created a ramfs mount with this link (https://unix.stackexchange.com/questions/66329/creating-a-ram-disk-on-linux), and voila! it runs smoothly. The speed only decreased by about 7%, which is fairly acceptable to me.

facebookresearch / fairseq

Huge amount of CPU RAM needed during training #574