dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.99k stars 1.88k forks source link

Should Random.Utilities be thread safe and more random? #168

Closed veikkoeeva closed 5 years ago

veikkoeeva commented 6 years ago

Coming from https://github.com/dotnet/machinelearning/issues/166.

  1. The various random number generators are not thread safe. It might make sense to make them such. Are there cases against not making them such?
  2. Should the generators be seeded with cryptographic random seed to make sure they're random when no seed is given?
  3. The MachineLearning.Net has tendency to put many classes in one file (and have them uncommented). Would there be opposition to put the classes into separate files?
  4. The library has also a tendency to put functions not strictly needed for the functionality in the domain classes. Such as saving, like at https://github.com/dotnet/machinelearning/blob/436700aadf615e7f05a22925476cc441c63a919d/src/Microsoft.ML.Core/Utilities/Random.cs#L186. Would there be opposition to provide these as extension methods or in some other fashion? Since it might be one might want to serialize things differently and organizing differently would avoid baking in dependencies. A related idea: https://github.com/dotnet/machinelearning/issues/126#issuecomment-388485566.

At a glance point, there can be other issues too. Would there be opposition to refactor points 1 through 3? Possibly 4?

The source: https://github.com/dotnet/machinelearning/blob/436700aadf615e7f05a22925476cc441c63a919d/src/Microsoft.ML.Core/Utilities/Random.cs

TomFinley commented 6 years ago

Hi @veikkoeeva , to answer:

  1. Generally, as a rule stateful objects should never attempt to make themselves thread safe, not just in ML.NET, but everywhere. The exception for some classes whose multi-threaded usage is possible to anticipate and is part of the design, e.g., concurrent classes. Otherwise it is a bad idea, since what constitutes "thread safety" is highly context dependent. Indeed, in the case of PRNGs specifically, if there is a case in the code where there is uncontrolled access from multiple threads to a single PRNG this would mean we could not count on deterministic results from whatever code was utilizing that PRNG, so that usage would be considered a bug and fixed, so that the threads were accessing their own private instances of PRNGs.

  2. For the purposes used in ML specifically, I have not heard that utilizing cryptographically secure PRNG (whether for seeding or any other purpose) has any utility. Let me know if you have heard differently.

  3. This is a matter of personal taste, so consider my answer for myself only. There are two schools of thought on this: first that it's best to have each non-nested type in its own file (indeed, in some languages like Java you have no choice), the other that logical grouping of classes in a single file, especially something meant to be understood logically as a unit anyway, helps aid comprehensibility of the code as a whole. As you've correctly observed the code was written under the second style of thinking. I'd personally rather not change it; while as a general rule I think types in their own files is a fine thing to do 95% of the time, in those remaining 5% of cases here I consider the current arrangement an aid rather than impediment to my understanding of this codebase. But my opinion is on a case-by-case basis, since what constitutes "more comprehensible" is incredibly subjective. :)

  4. We write model files containing serialization of the behavior of loaders, transforms, predictors, and other things. This means certain objects (including this one) need to be serialized in some consistent fashion. (Indeed, this is half the reason why TauswortheHybrid exists in the first place, since certainly we would have preferred to just rely on System.Random if such a thing were possible.) So, I would say this is strictly needed functionality. But even if the ability to load/save itself hypothetically wasn't necessary core functionality, if something (not just this object) needs to serialize itself, generally the only code that "understands" the internal states of a class sufficiently well to serialize it is the class code itself. Just by the principle of encapsulation I'd consider any logic to do this that lived outside the class to be rather suspect.

Now, as it happens, TauswortheHybrid, for totally separate reasons, had to expose its guts anyway. So in principle you could write a separate utility class for serializing/deserialing the state. But I'm not certain I would be motivated to make that change.

veikkoeeva commented 6 years ago

@TomFinley

  1. Sure. :) It looks like many projects define SafeRandom of some sort to make programmers fall into pit of success. It may not easy to know in which thread the variables are executed in asynchronous workflows or that not notice Random instances created closely together (I suppose about 15 ms currently) are seeded with the same value and hence produce the same output also. This is very old, but shows some problems. There are other articles too. I have not checked how randomness is used in this project, but this leads me to think one "needs to know how to use the library" not to introduce inadvert bugs (that could perhaps be avoided by playing safe with these).
  2. Reading here, for instance (or other sources for that matter) makes me believe one might get "odd results" when writing var rand1 = new Random(); var rand2 = new Random(); and ending up with the same random sequence. :) See the link in 1..
  3. I'm fine with either, though prefer separe files. You might've noticed already I'd go with leaner classes and more functionally too. I can live, poking around and pestering here. I likely stop soon enough. :)
  4. I saw the other issue (number of which eludes me currently) that had this dependecy discussion going on. I'm currently poking around a bit on the wee hours to get a feeling of the library. To make a bit concrete: I think this in terms:

-

  1. Readonly data in from constructor call to classes. Parameters checked so invariants holds.
  2. Function calls should have all data as arguments, everything in return values. Parameters checked so invariants hold.

Particularly 2. should be used also for internal calls so one doesn't end up in a situation like this:

public class Stuff
{
    private int stuffData0 = 0;
    /* ...*/
    private int stuffDataN = n;

   public int DoStuff1(int x)
   {
       /* How to refactor this? Can the functions be rearranged? Variabless assigned in different order? */
       DoStuff2(x);
       DoStuff3();
       /* Do something with internal variables. */
       DoStuff4();
       /* Do something with internal variables. */
       DoStuffN();   
   }
}

Then when the state of the class gets corrupt and leaks out of the public function to elsewhere in the program, it's difficult to track down. I obviously make this a pointed example, but this happens easily in inheritance chains and it leads easily to thinking that something can be added to base classes even if the inheritors do not use it. Or that a reason to use base classes is to "share functions", even if that could be done with just function calls (or extension methods) But this is tangetial to this discussion...

Why would TausworthHybrid need to expose its internals? I'd be curious if it shed some light and maybe be something to refactor in the future.

TomFinley commented 6 years ago

Hi @veikkoeeva , thanks for writing back.

It looks like many projects define SafeRandom of some sort to make programmers fall into pit of success.

I think though what I'm saying is, this is not a pit of success. ML has lots of stochastic behavior, but simultaneously reproducibility of results is also critical. If multiple threads are accessing the same random number generator at once, then determinism and so reproducibility is impossible since there is a race condition. That is a bug.

It is certainly true that using var r1=new Random();var r2=new Random() could yield bad results. This and the desire for determinism is one of the reasons why we have IHost.Rand throughout the codebase. We instantiate a random only once, and all further random number generators should be "forked" from there. (And, if you trace through the code, you will indeed find that the only time a System.Random is instantiated outside of test code, is exactly once, during the creation of the host environment.)

TomFinley commented 6 years ago

Regarding question of state, again reprodicibility. All IDataViews have the contract that if you open a cursor on them, you will get exactly the same results, and results are consistent. We have lots of code that depends on this. Simultaneously: we have some IDataViews, specifically some transforms, with stochastic behavior... e.g., if you open a cursor on a GenerateNumberTransform, and it yields the numbers [5,2,3,10] or something, when you open a cursor on it again it will yield the same sequence. Now then: there are different ways to solve this problem, but the one the author (not me, forget who) went with was to make the state extractable, and then be able to put it back.

That seems like a mistake to me. Even if I'd gone with the State based solution, why expose U1, through U4? But anyway, since it did, hypothetically one could enact your proposal of writing the serialization logic (which again is necessary, for serialization reasons) outside of the class.

veikkoeeva commented 6 years ago

@TomFinley Maybe the original author thought that the initial state could be exposed as a readonly public properties (maybe public uint InitialState1 { get; } were better IntelliSense-wise, but there's a slight difference in IL there) so that the construction could take the initial parameters as constructor parameters (checked, readonly).

Indeed, this design would make it so that one could serialize random parameters out and read them back, even in the middle of computation. Then if it were possible to define a canonical interface (just as a rough example to bring concretia: https://github.com/dotnet/machinelearning/issues/126#issuecomment-388485566), one could bring any (de)serializer and load them dynamically. Without touching the core types. There are interesting things one could do here then, such as pausing/persisting arbitrary computation on the disk and continuing later (one such library is https://github.com/polytypic/Recalled/blob/master/Recalled.paket.template and it's rather performant, sadly the video showing it off had disappeard from Vimeo).

As for SafeRandom, it doesn't need to be complicated. On example is here, could be put into MachineLearningNet.Random (or what have you, note that without giving a seed a random one is generated and it might make sense to expose it in public readonly property and have a constructor that can take that as a parameter). I get a feeling IHost acts like an ambient context storing things, but alternatively the relevant functions could take Random as a (constructor) parameter too (exposed as readonly property in the receiving type). Then when they get the result and write them to a data structure (DataView?), it can checkpoint (or whatever action) all that with the parameters in the format someone thinks is appropriate. If Random were to have a field such as Tag that could be user-labelled, it'd be easy to check persisted files, debugging etc. too things match.

I think it's plausible people might use their own random instances (if this library gets popular, there is a lot of value being able to make it "just work") and have trouble, so it'd be easier to point them to use safe random operators. Something like the linked SafeRandom which is difficult to use erroneously. More generally, I also think that if the purpose is to interface with other libraries, it's in general easier to favour passing parameters instead of trying to fit things in "rules of the library". Take a pinch of salt, though. I see somewhat duplicating .NET code and idioms and "rules of the library instead of the framework" thinking and it makes me a bit unease if I need to learn a lot of new concepts to be able to grok the system.

codemzs commented 5 years ago

Seems like Tom has answered your questions. Closing this but if you have more questions please feel free to open a specific issue on it.