Communication patterns of distributed machine learning

LourensVeen commented 4 years ago

Hi all,

I'm not sure whether this is the right way to ask a question, and this question is strictly speaking outside of the scope of the SIG as defined in the README, but I'm hoping that someone can help me anyway, or send me to someone who can help, or tell me how to submit this better.

I'm working on the design of a distributed data-processing framework that can work with data that is stored in different locations, and run different parts of a workflow in different locations. One of the use cases is distributed machine learning, including the more traditional regression models as well as deep neural networks. Note that this is not about parallelising on an HPC cluster or a GPU, but about learning on data sets in different physical locations without moving the data.

It is my understanding that training algorithms for neural networks, and probably also iterative methods for fitting traditional ML problems, will repeatedly read the training data and update the model weights/parameters, until some equilibrium is reached. I also think (but am not at all sure) that in the distributed case, the current weights are broadcast to all sites, after which each site calculates a weight update, which are then sent back to a central location and combined to create a new set of current weights, and this is repeated many times. Or perhaps the weights are passed from one site to the next, updated at each site, and then passed around all sites in a loop many times.

Is the above correct? Or is this implemented differently? (I guess maybe people also do distributed matrix calculations for solving large regression problems in parallel? In the case of the model being passed around in a loop, do we need to visit data sets multiple times, or can we make do with a single go-around?)

If that's right, my second question is about performance. For typical data sizes that we see in our projects (let's exclude astronomy :)), how long does it take to calculate a weight update? I'm looking for a typical range of orders of magnitude here, are we talking milliseconds, seconds, minutes, hours? And while we're at it, what's the typical order of magnitude for the number of updates needed to train a model?

The reason for wanting to know this is that some designs have much more overhead in starting to process an update request than others, which doesn't matter much if it's going to spend several minutes calculating things anyway or if there are only few requests per training job, but would ruin performance if we're talking milliseconds per weight update and thousands of updates or more. The slow designs are easier to implement, but I suspect milliseconds and thousands is common, in which case I'll need to pick something more advanced.

Thanks in advance,

Lourens

bouweandela commented 4 years ago

Hi Lourens,

I'm no expert on this, but my guess would be that the answer to your questions rather depends on the model (and the size of your dataset). Maybe a good starting point could be to look at existing frameworks for distributed machine learning? E.g. https://ml.dask.org

LourensVeen commented 4 years ago

Hi Bouwe,

Thanks! That page has some useful information, and it seems like I'm not completely barking up the wrong tree at least. Scikit-learn also has an example for learning on streaming data which is interesting, at https://scikit-learn.org/0.15/modules/scaling_strategies.html.

This has made me realise that there's probably a dependency on how the data are distributed. In the example above, it seems that the documents are distributed in more or less random order, and reading two of the batches gives you a good enough sample of the whole distribution. In other words, there's more variance within the sets than between them. But if you have different data sets in different places, then it may be the other way around.

One scenario would be a collaboration of biotech companies who want to collaborate on drug discovery. These companies typically specialise in a particular set of molecules, and have databases with protein binding affinities for lots of different but similar molecules. These data are highly confidential. So you could try to train a model on all the data to maybe discover a whole new group of interesting molecules, but you'd have to use distributed learning and send the model to the data. In this case, you'd probably have to iterate over all the databases many times, or your model would end up skewed towards the first(?) data set visited.

I think I've managed to find a design that's not too hard to implement and can be cleanly extended to low-overhead execution in case it's needed, so I'm going to prototype that and then we'll see.

Thanks again,

Lourens

sonjageorgievska commented 4 years ago

Just saw this (it was on my to-do list), why not discuss live instead? I think we already talked about it a bit last year?

LourensVeen commented 4 years ago

Hi Sonja,

Well, if you have any input I'd be very happy to hear! We did indeed discuss distributed machine learning last year, when I was asked for advice on a proposal involving the drug discovery example I mentioned above. I'm now looking at applications in personalised medicine, but the use case is essentially the same from an infrastructure point of view.

I closed the issue because I've found a design that's easy enough to implement and can deliver that kind of performance if needed. If my system can do both, then it doesn't matter any more what people are likely to want to use :).

I'd be happy to discuss some more though if you or others want, it's an interesting topic.

bouweandela commented 4 years ago

Indeed I think this could be an interesting discussion topic for a machine learning SIG meeting, if you would like to present it there?

Regarding the confidentiality issue, I think that there must be some level of trust between the participating parties, because a cleverly chosen (e.g. deep generative models) model could probably be used to partly 'reverse engineer' the training data.

LourensVeen commented 4 years ago

Yes, there are many trust and security issues in these kinds of systems :). I have done a bit of security analysis already, and there are various things that can be done to mitigate this (limit what software/models can run to ones that have been audited, limit the amount of data/entropy that can come out, refuse jobs that don't anonymise the data before putting it into the model, etc.). But that's a topic for another day and location.

About a presentation: yes in principle, but it will be a while before I can manage. I don't have much time for this project, and I'm still fighting Fortran and trying to get a paper out the door on my other project. Also it would probably be good to get a bit more information from the new projects, so that we have a clearer use case.

NLeSC / Machine_Learning_SIG

Communication patterns of distributed machine learning #7