Closed veikkoeeva closed 5 years ago
Hi @veikkoeeva , to answer:
Generally, as a rule stateful objects should never attempt to make themselves thread safe, not just in ML.NET, but everywhere. The exception for some classes whose multi-threaded usage is possible to anticipate and is part of the design, e.g., concurrent classes. Otherwise it is a bad idea, since what constitutes "thread safety" is highly context dependent. Indeed, in the case of PRNGs specifically, if there is a case in the code where there is uncontrolled access from multiple threads to a single PRNG this would mean we could not count on deterministic results from whatever code was utilizing that PRNG, so that usage would be considered a bug and fixed, so that the threads were accessing their own private instances of PRNGs.
For the purposes used in ML specifically, I have not heard that utilizing cryptographically secure PRNG (whether for seeding or any other purpose) has any utility. Let me know if you have heard differently.
This is a matter of personal taste, so consider my answer for myself only. There are two schools of thought on this: first that it's best to have each non-nested type in its own file (indeed, in some languages like Java you have no choice), the other that logical grouping of classes in a single file, especially something meant to be understood logically as a unit anyway, helps aid comprehensibility of the code as a whole. As you've correctly observed the code was written under the second style of thinking. I'd personally rather not change it; while as a general rule I think types in their own files is a fine thing to do 95% of the time, in those remaining 5% of cases here I consider the current arrangement an aid rather than impediment to my understanding of this codebase. But my opinion is on a case-by-case basis, since what constitutes "more comprehensible" is incredibly subjective. :)
We write model files containing serialization of the behavior of loaders, transforms, predictors, and other things. This means certain objects (including this one) need to be serialized in some consistent fashion. (Indeed, this is half the reason why TauswortheHybrid
exists in the first place, since certainly we would have preferred to just rely on System.Random
if such a thing were possible.) So, I would say this is strictly needed functionality. But even if the ability to load/save itself hypothetically wasn't necessary core functionality, if something (not just this object) needs to serialize itself, generally the only code that "understands" the internal states of a class sufficiently well to serialize it is the class code itself. Just by the principle of encapsulation I'd consider any logic to do this that lived outside the class to be rather suspect.
Now, as it happens, TauswortheHybrid
, for totally separate reasons, had to expose its guts anyway. So in principle you could write a separate utility class for serializing/deserialing the state. But I'm not certain I would be motivated to make that change.
@TomFinley
SafeRandom
of some sort to make programmers fall into pit of success. It may not easy to know in which thread the variables are executed in asynchronous workflows or that not notice Random
instances created closely together (I suppose about 15 ms currently) are seeded with the same value and hence produce the same output also. This is very old, but shows some problems. There are other articles too. I have not checked how randomness is used in this project, but this leads me to think one "needs to know how to use the library" not to introduce inadvert bugs (that could perhaps be avoided by playing safe with these).var rand1 = new Random(); var rand2 = new Random();
and ending up with the same random sequence. :) See the link in 1.
.-
Particularly 2. should be used also for internal calls so one doesn't end up in a situation like this:
public class Stuff
{
private int stuffData0 = 0;
/* ...*/
private int stuffDataN = n;
public int DoStuff1(int x)
{
/* How to refactor this? Can the functions be rearranged? Variabless assigned in different order? */
DoStuff2(x);
DoStuff3();
/* Do something with internal variables. */
DoStuff4();
/* Do something with internal variables. */
DoStuffN();
}
}
Then when the state of the class gets corrupt and leaks out of the public function to elsewhere in the program, it's difficult to track down. I obviously make this a pointed example, but this happens easily in inheritance chains and it leads easily to thinking that something can be added to base classes even if the inheritors do not use it. Or that a reason to use base classes is to "share functions", even if that could be done with just function calls (or extension methods) But this is tangetial to this discussion...
Why would TausworthHybrid
need to expose its internals? I'd be curious if it shed some light and maybe be something to refactor in the future.
Hi @veikkoeeva , thanks for writing back.
It looks like many projects define SafeRandom of some sort to make programmers fall into pit of success.
I think though what I'm saying is, this is not a pit of success. ML has lots of stochastic behavior, but simultaneously reproducibility of results is also critical. If multiple threads are accessing the same random number generator at once, then determinism and so reproducibility is impossible since there is a race condition. That is a bug.
It is certainly true that using var r1=new Random();var r2=new Random()
could yield bad results. This and the desire for determinism is one of the reasons why we have IHost.Rand
throughout the codebase. We instantiate a random only once, and all further random number generators should be "forked" from there. (And, if you trace through the code, you will indeed find that the only time a System.Random
is instantiated outside of test code, is exactly once, during the creation of the host environment.)
Regarding question of state, again reprodicibility. All IDataView
s have the contract that if you open a cursor on them, you will get exactly the same results, and results are consistent. We have lots of code that depends on this. Simultaneously: we have some IDataView
s, specifically some transforms, with stochastic behavior... e.g., if you open a cursor on a GenerateNumberTransform
, and it yields the numbers [5,2,3,10] or something, when you open a cursor on it again it will yield the same sequence. Now then: there are different ways to solve this problem, but the one the author (not me, forget who) went with was to make the state extractable, and then be able to put it back.
That seems like a mistake to me. Even if I'd gone with the State
based solution, why expose U1
, through U4
? But anyway, since it did, hypothetically one could enact your proposal of writing the serialization logic (which again is necessary, for serialization reasons) outside of the class.
@TomFinley Maybe the original author thought that the initial state could be exposed as a readonly public properties (maybe public uint InitialState1 { get; }
were better IntelliSense-wise, but there's a slight difference in IL there) so that the construction could take the initial parameters as constructor parameters (checked, readonly).
Indeed, this design would make it so that one could serialize random parameters out and read them back, even in the middle of computation. Then if it were possible to define a canonical interface (just as a rough example to bring concretia: https://github.com/dotnet/machinelearning/issues/126#issuecomment-388485566), one could bring any (de)serializer and load them dynamically. Without touching the core types. There are interesting things one could do here then, such as pausing/persisting arbitrary computation on the disk and continuing later (one such library is https://github.com/polytypic/Recalled/blob/master/Recalled.paket.template and it's rather performant, sadly the video showing it off had disappeard from Vimeo).
As for SafeRandom
, it doesn't need to be complicated. On example is here, could be put into MachineLearningNet.Random
(or what have you, note that without giving a seed a random one is generated and it might make sense to expose it in public readonly property and have a constructor that can take that as a parameter). I get a feeling IHost
acts like an ambient context storing things, but alternatively the relevant functions could take Random
as a (constructor) parameter too (exposed as readonly property in the receiving type). Then when they get the result and write them to a data structure (DataView
?), it can checkpoint (or whatever action) all that with the parameters in the format someone thinks is appropriate. If Random
were to have a field such as Tag
that could be user-labelled, it'd be easy to check persisted files, debugging etc. too things match.
I think it's plausible people might use their own random instances (if this library gets popular, there is a lot of value being able to make it "just work") and have trouble, so it'd be easier to point them to use safe random operators. Something like the linked SafeRandom
which is difficult to use erroneously. More generally, I also think that if the purpose is to interface with other libraries, it's in general easier to favour passing parameters instead of trying to fit things in "rules of the library". Take a pinch of salt, though. I see somewhat duplicating .NET code and idioms and "rules of the library instead of the framework" thinking and it makes me a bit unease if I need to learn a lot of new concepts to be able to grok the system.
Seems like Tom has answered your questions. Closing this but if you have more questions please feel free to open a specific issue on it.
Coming from https://github.com/dotnet/machinelearning/issues/166.
At a glance point, there can be other issues too. Would there be opposition to refactor points 1 through 3? Possibly 4?
The source: https://github.com/dotnet/machinelearning/blob/436700aadf615e7f05a22925476cc441c63a919d/src/Microsoft.ML.Core/Utilities/Random.cs