Open jasonrute opened 1 year ago
How would we like to in general want the caching to behave? Either, we can load all the data into the memory in the format ready for the network (but then, there can be the scaling issue if the data don't fit into memory), or we want to reload all the data each epoch (which can be slow). Or are there any other options? (like saving a cache to hard drive in TF format, I am not sure how feasible it would be). I would like to first properly understand our aim, and then we can try to figure out the technical details of whether we want to use tf.Dataset, or not. Note that the loader basically offers random access to the datapoints (but it is computing them every time they are accessed).
To me, the basic requirements of training are like this:
To me, it seems that the proper way of dealing with this is as follows:
n
is being trained, the next epoch n+1
is being prepared: The index from (1) is randomly permuted and split into batches. This should be pretty quick, but if it is not, we have the entire epoch to calculate it.i
of epoch n
is being trained, batch i+1
is fetched from the Captn Proto dataset. That is, we calculate the forward closures of all the root nodes in the batch and load them into a Numpy array or whatever other datastructure.I see, so we would like the Dataset
to look ahead, and prepare a batch it was not asked about yet... I would have to look more into Dataset
to see if it is happening by default, or how to do it.
Also, at some point, we were considering moving the graph loader into Cython in case that would be a bottleneck (but I think we concluded there were more serious speed issues).
Note that if it is indeed the case that we need tf.Dataset
in order to parallelize over multiple GPU's, then I propose this scheme: A tf.Dataset
basically corresponds to a 'RAM-batch'. This is the largest size that we are willing to load into RAM. The tf.Dataset
can then split this batch into smaller 'GPU-batches'.
The only other alternative I see is to load the entire Capt'n Proto dataset into a tf.Dataset
. But if we run out of memory, then the tensorflow code will have to be responsible for swapping part of the tf.Dataset
to disk. Does this functionality exist? (I would think so, because surely we are not the only ones with datasets that exceed RAM?)
Looking though the tensorflow API, it looks to me like a lot of what I describe can be easily done using a combination of the prefetch functionality and the from_generator functionality.
I spend some time digging through the codebase and through the tensorflow documentation. My impression is that using tf.data.Dataset
is a good idea in general, and there is no reason why we can't have our cake and eat it too. Here is what I would suggest as a 'plan of attack':
tfgnn.dataset.Dataset
and loader.py_data_server.DataServer
into one class. This will save a lot of transformations in the pipeline and simplify the code. I don't really see any reason why we need two classes here.shuffle
function requires a buffer, which scales linearly with the size of the data in the buffer. Hence, it is much cheaper to do this early in the pipeline. When the shuffling is the first step in the pipeline, it does not have to take any memory at all. Currently, the shuffling is happening way too late.cache
. There are currently multiple in the pipeline. At most, there should be one call, but ideally none at all.prefetch
buffer sizes (instead of AUTOTUNE
). Use the TF Profiler to see if we have any bottlenecks.num_parallel_calls
to batch
and any remaining map
calls in the pipeline.interleave
function or plain old multiprocessing
to feed data into the pipeline in parallel.While I'm not very up-to-date on the code-level details of how the tf.data.Dataset
is being used right now, let me just point out that tf.data.Dataset.cache
supports on-disk caching too, and if I recall correctly this included "fancy" features such as sharding (for when e.g. you need to split the data into multiple files because otherwise it's too big or too slow to read a batch from a single disk). So in principle one should be able to mostly keep the same pipeline once the data becomes larger than the available RAM memory, except for some finetuning of the on-disk-caching. My impression is that trying to achieve this from scratch (handling serialization and deserialization, prefetching, sharding, etc) would be a lot of work and also mostly reinventing the wheel...
Attached is a diagram of my proposal. I suggest to entirely ditch the idea of a dataserver. This brings way to much overhead and complexity with it. Instead, I'd go for an entirely functional approach. stream-proposal.pdf
Edit: updated diagram somewhat.
Thanks for looking into this everyone, especially @LasseBlaauwbroek!
Is the tfgnn model now superior in every way to the tf2 model? If so, let's get rid of the tf2 model (if we ever need it again, it is in git's history).
Mostly. There were still a few more comparison experiments I wanted to run, but I could do those on v13 or v14 in old branches. I personally have no intention of going back to the TF2 model. All are new features are in in the TFGNN model.
Let's merge tfgnn.dataset.Dataset and loader.py_data_server.DataServer into one class. This will save a lot of transformations in the pipeline and simplify the code. I don't really see any reason why we need two classes here.
I'm not sure I understand the proposal here, but I'm open to it.
Remove any calls to cache. There are currently multiple in the pipeline. At most, there should be one call, but ideally none at all.
Yes we probably don't need all the calls to cache
we have. One is because we split the data in the dataset and we need a cache before the split to prevent recomputing all the test data for the training data pipeline. If we handle training and validation data better, then this will go away.
I'm not sure I understand all your proposals in detail yet, Lasse, but here are some high level things I'd like from a pipeline in order of preference:
I think we can easily do all of those things, maybe except for (1) because that is less related to the pipeline.
In the TFGNN model we are using
tf.Dataset
to make a data processing pipeline. It has certain advantages and disadvantages that we are bumping up into. In this issue, I'm gathering details so that we can have a complete picture of the situation.The good
The bad
Observations