apache / datasketches-cpp

Core C++ Sketch Library
https://datasketches.apache.org
Apache License 2.0
212 stars 70 forks source link

Determinism #361

Open RAMitchell opened 1 year ago

RAMitchell commented 1 year ago

Hi, I am interested in using this amazing project for a distributed machine learning application. My only blocker to doing this is the ability to reproduce results for a given seed. I am wondering if there is any interest in adding such a feature, or any way to use the existing library to get a deterministic result. It looks to me like it would require adding a random engine member to the sketches themselves and then modifying the appropriate serialisation code.

I also see that I could compile with the KLL_VALIDATION option to get a deterministic KLL sketch, although this is not thread safe.

Any suggestions?

AlexanderSaydakov commented 1 year ago

KLL_VALIDATION is for testing only. We are quite reluctant to provide a mechanism to defeat randomness. Perhaps we might reconsider if you explain your use case better. Why a slight difference within statistically guaranteed bounds would be a problem?

RAMitchell commented 1 year ago

It's often necessary to repeat experiments exactly in ML applications. Specifically I want to use it for quantising data in a distributed setting for an xgboost/lightgbm type algorithm. The default could still be to use a device random source but it would be nice to have the option to set the random state. Maybe as a template option to the kll sketch class?

AlexanderSaydakov commented 1 year ago

I am afraid you did not quite explain why is it necessary to repeat exactly. What happens if it is a bit different?

jmalkin commented 1 year ago

We've debated this before since it seems like a reasonable thing to do. The challenge is that in most cases it will break the sketch error distribution -- and still not actually be deterministic!

Determinism would depend on feeding data to the sketch in a deterministic order, and if merging across nodes also merging in a deterministic order. I think presenting data in sorted order, however, tends to be less good for the results. It would also be a Very Bad thing for every sketch to use the same seed since errors would be correlated across nodes, so you'd need to ensure that each node processing data has a deterministic but unique seed.

There might be other gotchas, but that's what comes to mind off the top of my head. We've been quite reluctant to provide "functionality" that makes things so fragile, especially if it's in ways that most library consumers won't be aware of.

RAMitchell commented 1 year ago

I am afraid you did not quite explain why is it necessary to repeat exactly. What happens if it is a bit different?

Here is an article that discusses reproducibility issues in ml: https://blog.ml.cmu.edu/2020/08/31/5-reproducibility/

Also, I think many users find it difficult to evaluate the relative impact of changes to training hyper-parameters on eval set performance when changes are also occurring due to randomness. Having a seed allows them to deal with these changes separately.

I think you could argue ML folks are doing things wrong but it does seem to be a common requirement we get. So much so that we rebuilt the xgboost librarys gpu algorithms to be deterministic even with with floating point atomics.

I am prepared to carefully seed each sketch instance differently, I agree this is a potential pitfall but think HPC developers are used to dealing with this issue for things like random number generation or monte carlo simulations.

It's relatively easy to ensure the data is fed to the sketches in the same order. The sketches would be merged in a tree reduce type pattern which is also deterministic. We would be running this code on the legate framework so as long as it partitions the data to workers in the same way on repeated runs, these conditions will be met.

We will never be able to guarantee determinism on different partitionings but that is fine.

RAMitchell commented 1 year ago

If you think I would be better off using another type of sketch I would also be interested in suggestions.

jmalkin commented 1 year ago

Just an update on where we are. While still skeptical of whether we really want to add a "shoot yourself in the foot" button in general, it's not an unreasonable request in a limited set of circumstances.

The next question is how to do this sanely. We recently made a change to provide thread local randomness as the previous implementation wasn't thread-safe. That's good in general but problematic when setting the seed. Thread id isn't a stable thing, at least not on across OSes. We also want to make sure that trying enable making the sketch deterministic doesn't kill performance or lead to a substantial increase in memory use. Any such state also wouldn't be persisted during serialization.

One possibility is that setting the seed would preclude proper multithreaded behavior. That seems...not ideal, but perhaps a necessary trade-off given other considerations. But that opens up the problem of determining when we're trying to use the sketch in a multithreaded context.

At least in C++ (as opposed to our Java library) we do have the option to use #ifdefs, although we don't want to be too aggressive with them.

Any suggestions here are welcome!

RAMitchell commented 1 year ago

One problem here is that the mersenne twister prng is quite large I think, and the alternatives are not strong prngs.

At the moment I have moved to another solution so I don't immediately need this, however I think the discussion is very valuable.

I might want to use it later for things like calculating medians for distributed data, which is a minimiser for l1 error.

Perhaps a new non-deterministic algorithm could be added like a mergeable gk sketch?

jmalkin commented 6 months ago

Well, it's not serialized as part of the sketch and I guess that limits the utility, but at some point we moved the random_bit into common_defs.hpp where it's directly user-facing. As a result, I believe you can set the seed as well as get/set the state of the underlying engine.

And you're quite right that the state is larger, as I observed when testing to confirm that operations on datasketches::random_utils::random_bit work as expected.

If this is still an interesting problem, please let me know if this is sufficient?

Also, apologies about the absurdly long delay getting back to this. It occurred to me this evening that we had likely (partially) addressed this ask without realizing it.

jmalkin commented 3 months ago

@RAMitchell Going to close this for now since we have a solution even if perhaps not ideal. We'd be reluctant to change the serialization format even if we have a RNG with a smaller state since that breaks compatibility. And we're not experts in RNGs so I don't think we don't know how to make it cross-platform using native language sources of randomness.

Please feel free to re-open as needed.

leerho commented 3 months ago

@RAMitchell, I have posted a discussion topic where I would welcome your comments. Thanks!