AggregateException/InvalidOperationException when training from IEnumerable backed by EF Core Context

endintiers commented 5 years ago

System information

.Net Core 2.2, EF Core 2.2.1, ML.NET 0.9

Issue

Read data from SQL DB via EF Core as IEnumerable, pass to pipeline via streaming IDataView
Get InvalidOperationException from Microsoft.ML.Transforms.RowShufflingTransformer.Cursor.LoopProducerWorker()
It worked fine in ML.NET 0.8. I can work-around by forcing the trainer to be single-threaded but this makes training in 0.9 much slower than in 0.8 (on my machine).

The EF Core problem could be addressed by allowing multiple dbcontexts to be provided but this is likely to be an issue with any non thread-safe IDataView source...

Source code / logs

Source at: https://github.com/endintiers/Unearth.Demo.ML.FromDB shows this working in 0.8, failing in 0.9 and a work-around for 0.9 (single-threading the training).

Ivanidzo4ka commented 5 years ago

Thank you for reporting this. We need to investigate why we try to shuffle data even if it's comes from StreamingDataView which shouldn't be shuffle-able. If you need immediate patch for current problem, i would suggest you to add AppendCacheCheckpoint to the end of your dataProcessPipeline.

var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey("IATACode", "Label")
                    .Append(mlContext.Transforms.Text.FeaturizeText("FlightCode", "FlightCodeFeaturized"))
                    .Append(mlContext.Transforms.Concatenate("Features", "FlightCodeFeaturized")).
                    .AppendCacheCheckpoint();

this way we would read all cache all data in memory, which would speed up training, and also allow us properly shuffle data during training.

endintiers commented 5 years ago

OK, I see what AppendCacheCheckpoint does - it caches the whole Enumerable (gradually) so the cache is used for re-reads/shuffles.

The 'Airlines' data is just some public test data I can use for demos. I have more than a billion rows of real data in an Azure SQL DB that I am using in my day job. Even if I'm training on a small subset of this (I'm using 1%) I don't think caching it in memory is a good idea... I'd like to be able to give the trainer a 100K rows at a time or somesuch and dispose of it before moving on to the next tranche.

I have been trying to figure out a fix for myself but there is a lot of code to understand... Could it be that the IDataView needs to determine if CanShuffle is true for this source? But since I used 'Where' the underlying Enumerable is seen as a Linq object? So therefore (for now) ensure that CanShuffle is false for all Streaming IDataViews… Maybe later [insert clever code here] make more nuanced decisions...

endintiers commented 5 years ago

The glass is still cloudy, but I have noticed that the LinearTrainerBase.PrepareDataFromTrainingExamples method in SdcaBinary.cs always sets ForceShuffle to be true when constructing it's RowShufflingTransformer. It ignores the original streaming dataview's CanShuffle flag (which is always false for any streaming dataview).

That might make sense - the RowShufflingTransformer is supposed to abstract away the implementation of shuffles. And it works for other streaming sources and the trainer WANTS shuffled data to work good.

The actual problem is that EF Core DBContexts aren't thread-safe, so if cursors from multiple threads exhaust the pool and more than one tries to access the Enumerable to get more: 'boom'. Maybe the answer is to add another flag to the source IDataView ('IsNotThreadSafe') and then implement a lock when fetching more pool data (in that specific case)?

CESARDELATORRE commented 5 years ago

Adding @eerhardt - Please, take a look to this issue. Not directly related to the current DatabaseLoader we're implementing, but we should fix LoadFromEnumerable() when possible, as well.

Adding also @divega who mentioned this bug in a related discussion, as well.

LoadFromEnumerable doesn't always work properly because it assumes that the Enumerable-backed IDataView is thread-safe - this is not always the case, like when using EF Core DbContext.

@endintiers proposal is the following: So I propose starting by adding a switch to LoadFromEnumerable 'singleThreaded=true' which would intercept IDataView cache update requests and force them onto a single thread. This may slow down the trainers a bit but nowhere near as much as single-threading them (where possible) and we can improve performance with eager loading. Doing that first would remove a lot of complexity from the DatabaseLoader task (threading no longer an issue).

Thoughts?

endintiers commented 5 years ago

@CESARDELATORRE I wouldn't say it's totally unrelated to DatabaseLoader. In my mind the obvious thing is for DatabaseLoader to leverage the existing stack under LoadFromEnumerable and various parts of IDataView and the cache loader... Unless there is some reason that generic database as a source is fundamentally different? But, yes, depending on what the spec said, assuming that the IEnumerable implementation is thread-safe is a bug.

eerhardt commented 5 years ago

The exception I get with the original code (using EF Core 2.2) says:

A second operation started on this context before a previous operation completed. This is usually caused by different threads using the same instance of DbContext, however instance members are not guaranteed to be thread safe. This could also be caused by a nested query being evaluated on the client, if this is the case rewrite the query avoiding nested invocations.

See https://github.com/aspnet/EntityFramework.Docs/blob/master/entity-framework/core/miscellaneous/configuring-dbcontext.md#avoiding-dbcontext-threading-issues for more details.

In ML.NET you are using the SDCA algorithm, which uses multiple threads by default. This is in direct conflict with the data source that you are reading from which is prohibiting multi-threaded access.

I don't see how something in ML.NET could solve this, there is an inherent mismatch between the algorithm and the data source. You will have to handle it on your own by doing one or more of the following:

Create a new DbContext for each enumeration. See our sample of how to do this here: https://github.com/dotnet/machinelearning-samples/blob/874a1fec13bbd51c543755048a4da6cfdd618d3a/samples/csharp/getting-started/DatabaseIntegration/DatabaseIntegration/Program.cs#L44-L60
Use in-memory caching, as @Ivanidzo4ka pointed out.
You can cache the data from the DB to an .idv file locally, and then perform training on the local data. (see mlContext.Data.SaveAsBinary().)
Change the SDCA options to only use a single thread.
Use a different algorithm that isn't multi-threaded by default. For example, LightGbm.

As for it working in EF Core 3.0, I see it working as well, but I am not certain it is intended given the above avoiding-dbcontext-threading-issues link. It looks like there is a new query pipeline in EF Core 3.0. See:

https://github.com/aspnet/EntityFrameworkCore/pull/14455 https://github.com/aspnet/EntityFrameworkCore/pull/15913

@divega @ajcvickers - any comment here? Should EF Core 3.0 be throwing an exception when multiple threads are accessing the same DbContext, like it did in earlier versions?

divega commented 5 years ago

@eerhardt, @CESARDELATORRE we had a very long discussion with Tom and others about this when we were preparing the sample at https://github.com/dotnet/machinelearning-samples/tree/master/samples/csharp/getting-started/DatabaseIntegration.

Notice that in https://github.com/dotnet/machinelearning-samples/blob/874a1fec13bbd51c543755048a4da6cfdd618d3a/samples/csharp/getting-started/DatabaseIntegration/DatabaseIntegration/Program.cs#L44-L60 we make sure each new enumeration of the data happen on a separate DbContext and DbConnection.

I would recommend that we point customers to this sample when questions like this arise, since we already made the investment to work out all the major kinks there.

Re the fact that ML.NET makes the assumption that IEnumerable<T> instances are thread safe, the more I think about it, the more I am convinced that it is just a bug. It is only true for a subset of implementations (e.g. in-memory collections that are not being modified) but as @ajcvickers likes to remind me, all abstractions in .NET are not thread-safe unless explicitly stated.

I think besides pointing people to the existing sample, other things the ML.NET team could consider are:

Document the special assumptions made in the LoadFromEnumerable() API and point to a fwlink.
Deprecate the current LoadFromEnumerable() and create a new overload that takes a factory argument of type Func<IEnumerable<T>>. This gives the user a hint (although maybe not too obvious) that a new IEnumerable<T> should be created every time, although it is simple enough to just pass always the same IEnumerable<T> instance if the user knows that the data is already cached in memory.
You could even have a LoadFromInMemoryEnumerable() method with the current signature of LoadFromEnumerable() for the simple scenarios.

divega commented 5 years ago

Re how this impacts DatabaseLoader, although it shouldn't need to use LoadFromEnumerable() because going directly from a DbDataReader to an IDataView should be more efficient, we will run into the same fundamental problem because DbConnection implementations are not thread-safe, and even worse, many of them don't even support MARS (multiple active result sets).

ajcvickers commented 5 years ago

100% agree with @divega

endintiers commented 5 years ago

@eerhardt >> I don't see how something in ML.NET could solve this The same way that WinForms solved the UI update thread problem - marshalling. If an IEnumerable is marked as single-threaded then all cache update requests should be intercepted, the cache requesting thread goes into Wait, the request is marshalled to the original thread that dispatched the IEnumerable (which has been waiting for this), the data is retrieved, the requesting thread is woken. This will result in slightly slower training for multi-threaded models, but the flag in LoadFromEnumerable need only be set for cases where this is an issue.

Something like that anyway :-) There has to be a queue for cache update requests.

endintiers commented 5 years ago

@divega @ajcvickers >> need to use LoadFromEnumerable() DatabaseLoader doesn't need to use the top level LoadFromEnumerable code but will quickly start to depend on code further down (which is multi-threaded). The issue will be the same. A decision needs to be made: allow single-threaded cache sources and code for that or specifically disallow them (maybe for V1?). Developers will still use LoadFromEnumerable to access database-backed/single-threaded IEnumerables when DatabaseLoader is available, so the same can be said for LoadFromEnumerable (fix or document)

Although (thinking) since all that DatabaseLoader is doing is trying to make it easier for devs to create an IEnumerable (the end game here) from some database, adding an additional method at the IDataView level could be seen as needlessly making that interface more complex. The 'database wizard' stuff could easily sit on top of LoadFromEnumerable? (am I missing some subtlety here?)

divega commented 5 years ago

DatabaseLoader doesn't need to use the top level LoadFromEnumerable code but will quickly start to depend on code further down (which is multi-threaded). The issue will be the same.

@endintiers Maybe I wasn't clear, but this is exactly what I meant when I said:

we will run into the same fundamental problem because DbConnection implementations are not thread-safe...

The thing is that with most databases using multiple connections can be a cheaper approach than marshalling, at least from the perspective the number of changes that user code or ML.NET may need (finding out if it is ultimately more efficient at runtime would require some testing, but database drivers are usually optimized to be used this way).

The 'database wizard' stuff could easily sit on top of LoadFromEnumerable?

Possibly what you are missing is why we want to skip IEnumerable<T> and LoadFromEnumerable in the database loader: DbDataReader and IDataView are similar in that they work like sliding windows over the data. Neither has the concept of a .Current object as IEnumerator<T> does, hence when reading data there is no need to allocate and initialize an object to be returned by it on every iteration. Going from DbDataReader to IDataView directly can save a lot of runtime work and allocations.

eerhardt commented 5 years ago

Thanks for the pointer to the sample, @divega. I didn't know we already had this. I've updated my list above to include the link to this sample.

I've opened https://github.com/dotnet/machinelearning/issues/4036 on docs.microsoft.com to explicitly document this as a limitation/concern of LoadFromEnumerable.

CESARDELATORRE commented 5 years ago

@eerhardt - But I think that documenting it and pointing to a sample is just a short term workaround.

The fact that ML.NET makes the assumption that IEnumerable instances are thread safe feels completely like a bug. The way to fix a bug is by fixing the API implementation not making possible that users will get into trouble. Documenting a workaround for a bug is only a short term workaround.

As suggested by Diego, I also believe we need to change/fix the API implementation so it won't be possible for the user to get into this trouble, either by:

(Pointed by @divega) Deprecate the current LoadFromEnumerable() and create a new overload that takes a factory argument of type Func<IEnumerable>. This gives the user a hint (although maybe not too obvious) that a new IEnumerable should be created every time, although it is simple enough to just pass always the same IEnumerable instance if the user knows that the data is already cached in memory.
(Pointed by @divega) You could even have a LoadFromInMemoryEnumerable() method with the current signature of LoadFromEnumerable() for the simple scenarios.
(Pointed by @endintiers ) Marking and using the IEnumerable as single-threaded. This might slow down the training process multi-threaded models, but the flag in LoadFromEnumerable need only be set for cases where this is an issue.

Also, for better performance we're going to have the new DatabaseLoader (for relational databases). However, for non-SQL databases or other data sources we still need the LoadFromEnumerable() approach. It is not something we can simply deprecate.

My point is that it is better if we create "fences/guards" in the API itself than just documenting it and show a sample because the chances that developers will get into this trouble (like when using EF if not following the example we created) are pretty high...

CESARDELATORRE commented 5 years ago

@endintiers - Question related to SQL databases but in a different page researching possible customer scenarios and differences in features and issues depending on that:

When you use a database to train ML.NET models, what RDBMS do you use? SQL Server? Other? Do you use that database in Azure such as using Azure SQL Database? Or do you use SQL Server or any other RDBMS on-prem?

This question is not directly related to this issue so we can follow up by email if you'd like. Can you send me an email to 'cesardl at microsoft.com' or through Twitter to @CESARDELATORRE when possible? - Thanks! :)

I'll remove this comment as soon as we're talking offline. :)

frank-dong-ms-zz commented 4 years ago

This issue is already documented and we suggest that IEnumerable needs to be thread-safe, I would like to mark this as enhancement and mark this as P2 CC @harishsk

dotnet / machinelearning