DatabaseSource requires new SqlConnection even if an open Connection exists!

FranzBl commented 5 years ago

I already posted this issue in the thread about DatabaseLoader (that is now closed). Since no one gave a comment to this, perhaps it was overseen. In my view it is a design problem in the implementation of the ML-SQL-Interface:

DatabaseLoader requires its own database connection:

DatabaseSource dbSource = new DatabaseSource(SqlClientFactory.Instance, connectionString, sqlCommand); DatabaseLoader loader = mlContext.Data.CreateDatabaseLoader();

I suppose the idea behind this is that software developers will train their models in standalone programs that are not delivered with their software, to which only the trained models are passed. I would like to integrate model training in a larger Windows forms solution. There the model will not be static, but the training data will grow with time, when new values are detected and stored in the database. So a regular training of the model will be required to enhance performance. This should be done within the Windows program, whenever a new chunk of data is stored in the database. In this Windows program a SqlConnection is open all the time, from beginning to the end. So wouldn't if make sense if DatabaseSource would have an additional constructor to pass this existing SqlConnection instead of opening a new one? It is clear that during the time when DatabaseLoader reads the data table, there should be no new data to be stored in the database, to avoid interlock effects. But knowing this, it might be implemented without problems. New data for the database are produced only in certain time windows during day, and the model training could take place at night. Will there be no problem at all to open a new SqlConnection within a program that already has another one open?

CESARDELATORRE commented 5 years ago

Adding @tannergooding and @eerhardt who implemented that class. What's the reason you want to have a "a SqlConnection is open all the time, from beginning to the end of the WinForms app"?

When using a DbConnection the best practice is to minimize the time the connection is open by closing it as soon as possible. Since commonly you can use 'Connection Pooling' you shouldn't worry about the time for re-opening additional connections.

But the main reason we're currently not allowing the user to provide an existing DbConnection I think it is because we need to make sure each new dataset of the data happens on a separate DbContext and DbConnection. Otherwise, a user could potentially try to use the same DbConnection for multiple IDataViews and different trainings. That's an issue since the fundamental problem is that DbConnection implementations (different RDBMS) are not thread-safe, and even worse, many of them don't even support MARS (multiple active result sets).

See the following related issue for more context: https://github.com/dotnet/machinelearning/issues/2159#issuecomment-513394402

Also this one although in that case it is talking about the .LoadFromEnumerable() instead of the DatabaseLoader: https://github.com/dotnet/machinelearning/issues/2159#issuecomment-513380218

Additionally, if you maintain a DbConnection open for a long time, chances of having that connection broken by any transient network issue is higher.

Therefore, what's the reason you want to have a "a SqlConnection is open all the time, from beginning to the end of the WinForms app"?

In any case, @tannergooding and @eerhardt, do you see any additional downsize/risk if we overload the constructor and allow to provide an existing DbConnection?

FranzBl commented 5 years ago

--> "What's the reason you want to have a "a SqlConnection is open all the time, from beginning to the end of the WinForms app"?"

The program inquires stock values from the web in real time, and every few seconds asks for new values that are stored straightforward in the database. I suppose it would reduce performance if I would always open and close SqlConnection every time a new value is detected. This may be necessary probably several times for each inquiry of the stock list from the web since several values might have changed. Of course, within this event Queue any model Training is deactivated for the user. Training should occur only when such a "watch loop" has come to an end. Therefore I might change the program so that the SqlConnection is open only during the watch loop and is closed afterwards. If the day is over, the user might use the new values for additional Training and open a new database Connection to do this. Ok.

Up to now, during any Tests, Breaking of the database Connection did not occur during the watch loop, even if it is running from morning to evening. The database is stored on the local Computer, no network accesses are necessary.

But I understand that you have to provide the packages so that they are safe for any Kind of possibly "silly" user. Opening and Closing the DB Connection during start and end of watch loop is a Little tricky, for there are more of them - "watch only" and "watch and predict" (using my own prediction algorithm that I want to enhance with ML) and "predict only". Additionally the user might add new values to a running watch loop (in this case DB Connection must not be opened for it already is open) and so on. But this is my problem, not yours, and Solutions of course are possible.

CESARDELATORRE commented 5 years ago

Closing the issue if it's been clarified, ok? Thanks! 👍

FranzBl commented 5 years ago

Ok

dotnet / machinelearning-samples

DatabaseSource requires new SqlConnection even if an open Connection exists! #727