Improve our use of tf.Dataset

jasonrute commented 1 year ago

In the TFGNN model we are using tf.Dataset to make a data processing pipeline. It has certain advantages and disadvantages that we are bumping up into. In this issue, I'm gathering details so that we can have a complete picture of the situation.

The good

Datasets are basically required (as I understand it) to split your data on multiple GPUs. So we would at least need a Dataset at some stage of the pipeline.
Datasets make it very easy to work with individual graphs and then batch them into batched graphs when the time comes for batched data. This greatly improves the maintainability of some parts of the code.
It is a standard tensorflow API for data so it is fully featured.
It makes it easy for us to mix the definitions dataset with the proof states dataset.
It gives us a convenient way to cache the data in memory (or in files).
To the extent that we can cache our processed dataset in memory, it makes it really fast to work with the dataset over hundreds of epochs.

The bad

Our current use of tf.Dataset doesn't scale well and is slow. Ideally we wouldn't need to worry about shuffling or caching the data in the dataset and instead let our loader handle this, but the problem is that processing the data is too slow. (See the observations section below for some thoughts on what is slow.)
The dataset needs to be compiled. This is fine if you only have one super dataset pipeline, but was bad when we tried in the predict server to make a new dataset for every new definition.

Observations

Ideally we would like this dataset processing to be fast enough that we don't need to cache or shuffle the data outside of the loader. But this may also not be practical and we need to look into better ways to cache and shuffle the preprocessed data.
I haven't done exact timings, but processing the 70,000 definitions seems to take a lot longer than the 500,000 proof states. This could be either because the definitions are much larger graphs or because there is some additional step to the definition processing which is making it slow. Figuring out what steps are slow could make a big difference in speeding up the code without "throwing the baby out with the bathwater".
We also have some abilities to process the graphs outside the dataset. This was required for the predict server. We should compare timing of the two approaches, and maybe considering doing some of the processing outside the dataset.
I'm experimenting right now with recomputing definitions between embeddings. I'm not using the dataset pipeline, but I wonder if I should. I guess it is a good way to compare the speeds of both approaches.
There are whole articles on optimizing the dataset pipeline that we could look into.

mirefek commented 1 year ago

How would we like to in general want the caching to behave? Either, we can load all the data into the memory in the format ready for the network (but then, there can be the scaling issue if the data don't fit into memory), or we want to reload all the data each epoch (which can be slow). Or are there any other options? (like saving a cache to hard drive in TF format, I am not sure how feasible it would be). I would like to first properly understand our aim, and then we can try to figure out the technical details of whether we want to use tf.Dataset, or not. Note that the loader basically offers random access to the datapoints (but it is computing them every time they are accessed).

LasseBlaauwbroek commented 1 year ago

To me, the basic requirements of training are like this:

During one training epoch, we train on randomly generated batches (subsets) of the dataset. A batch is basically the size of GPU memory.
At some point, we will no longer be able to hold the entire training data into RAM. Hence, we will have to load it in batches. These batches can either be the same size as the GPU-batch, or a more granular 'RAM-batch'.
The loader will have to be fast enough such that one batch can be fetched while the previous batch is being processed by the GPU. (It's hard to imagine this being a problem, the loader seems relatively fast now.)

To me, it seems that the proper way of dealing with this is as follows:

When training starts, we load the dataset into mmap memory (note that this is an operation that takes minimal memory regardless of the size of the dataset, this is the beauty of Captn Proto). Then, we calculate an index that contains the root nodes of all proof states and definitions we want to train on. This index is kept in memory permanently, but should be fairly small.
While epoch n is being trained, the next epoch n+1 is being prepared: The index from (1) is randomly permuted and split into batches. This should be pretty quick, but if it is not, we have the entire epoch to calculate it.
While batch i of epoch n is being trained, batch i+1 is fetched from the Captn Proto dataset. That is, we calculate the forward closures of all the root nodes in the batch and load them into a Numpy array or whatever other datastructure.
We pray that fetching a batch is fast enough to keep the GPU busy. But if it is not, I guess we can parallelize this.

mirefek commented 1 year ago

I see, so we would like the Dataset to look ahead, and prepare a batch it was not asked about yet... I would have to look more into Dataset to see if it is happening by default, or how to do it. Also, at some point, we were considering moving the graph loader into Cython in case that would be a bottleneck (but I think we concluded there were more serious speed issues).

LasseBlaauwbroek commented 1 year ago

Note that if it is indeed the case that we need tf.Dataset in order to parallelize over multiple GPU's, then I propose this scheme: A tf.Dataset basically corresponds to a 'RAM-batch'. This is the largest size that we are willing to load into RAM. The tf.Dataset can then split this batch into smaller 'GPU-batches'.

LasseBlaauwbroek commented 1 year ago

The only other alternative I see is to load the entire Capt'n Proto dataset into a tf.Dataset. But if we run out of memory, then the tensorflow code will have to be responsible for swapping part of the tf.Dataset to disk. Does this functionality exist? (I would think so, because surely we are not the only ones with datasets that exceed RAM?)

LasseBlaauwbroek commented 1 year ago

Looking though the tensorflow API, it looks to me like a lot of what I describe can be easily done using a combination of the prefetch functionality and the from_generator functionality.

LasseBlaauwbroek commented 1 year ago

I spend some time digging through the codebase and through the tensorflow documentation. My impression is that using tf.data.Dataset is a good idea in general, and there is no reason why we can't have our cake and eat it too. Here is what I would suggest as a 'plan of attack':

Get rid of the old C++ loader code (it clutters up the repo, and the next steps will break it)
Is the tfgnn model now superior in every way to the tf2 model? If so, let's get rid of the tf2 model (if we ever need it again, it is in git's history).
Let's merge tfgnn.dataset.Dataset and loader.py_data_server.DataServer into one class. This will save a lot of transformations in the pipeline and simplify the code. I don't really see any reason why we need two classes here.
Move the shuffling of the data as early into the pipeline as possible. Ideally at the first step, where a proof state and definition is still a single root node. The shuffle function requires a buffer, which scales linearly with the size of the data in the buffer. Hence, it is much cheaper to do this early in the pipeline. When the shuffling is the first step in the pipeline, it does not have to take any memory at all. Currently, the shuffling is happening way too late.
Remove any calls to cache. There are currently multiple in the pipeline. At most, there should be one call, but ideally none at all.
Experiment with different prefetch buffer sizes (instead of AUTOTUNE). Use the TF Profiler to see if we have any bottlenecks.
If there are still bottlenecks, experiment with adding num_parallel_calls to batch and any remaining map calls in the pipeline.
If there are still bottlenecks, implement the forward closure computation in Cython.
If there are still bottlenecks, use the TF interleave function or plain old multiprocessing to feed data into the pipeline in parallel.
If there are still bottlenecks, go sit in the corner and cry.

fidel-schaposnik commented 1 year ago

While I'm not very up-to-date on the code-level details of how the tf.data.Dataset is being used right now, let me just point out that tf.data.Dataset.cache supports on-disk caching too, and if I recall correctly this included "fancy" features such as sharding (for when e.g. you need to split the data into multiple files because otherwise it's too big or too slow to read a batch from a single disk). So in principle one should be able to mostly keep the same pipeline once the data becomes larger than the available RAM memory, except for some finetuning of the on-disk-caching. My impression is that trying to achieve this from scratch (handling serialization and deserialization, prefetching, sharding, etc) would be a lot of work and also mostly reinventing the wheel...

LasseBlaauwbroek commented 1 year ago

Attached is a diagram of my proposal. I suggest to entirely ditch the idea of a dataserver. This brings way to much overhead and complexity with it. Instead, I'd go for an entirely functional approach. stream-proposal.pdf

Edit: updated diagram somewhat.

jasonrute commented 1 year ago

Thanks for looking into this everyone, especially @LasseBlaauwbroek!

Is the tfgnn model now superior in every way to the tf2 model? If so, let's get rid of the tf2 model (if we ever need it again, it is in git's history).

Mostly. There were still a few more comparison experiments I wanted to run, but I could do those on v13 or v14 in old branches. I personally have no intention of going back to the TF2 model. All are new features are in in the TFGNN model.

Let's merge tfgnn.dataset.Dataset and loader.py_data_server.DataServer into one class. This will save a lot of transformations in the pipeline and simplify the code. I don't really see any reason why we need two classes here.

I'm not sure I understand the proposal here, but I'm open to it.

Remove any calls to cache. There are currently multiple in the pipeline. At most, there should be one call, but ideally none at all.

Yes we probably don't need all the calls to cache we have. One is because we split the data in the dataset and we need a cache before the split to prevent recomputing all the test data for the training data pipeline. If we handle training and validation data better, then this will go away.

jasonrute commented 1 year ago

I'm not sure I understand all your proposals in detail yet, Lasse, but here are some high level things I'd like from a pipeline in order of preference:

Robust to tricky-to-debug errors (for example indexing errors in graphs).
Easy to experiment with ideas. Both you and I have ideas for other things we would like to add to training. If it is easy to add different data (and avoid bugs when doing so) that would be great and improve our scientific productivity.
Scalable to larger data (and remain runnable on at least 2 GPUs).
Fast. It doesn't have to be lightning fast, but it would be nice for it to be fast enough to get results in a few days of training. Similarly, for small test datasets, it should run in a few minutes.

LasseBlaauwbroek commented 1 year ago

I think we can easily do all of those things, maybe except for (1) because that is less related to the pipeline.

IBM / graph2tac