SQLKata integration for storing simulation outputs with different relational databases

JBris commented 11 months ago

Describe the new feature

APSIM makes use of SQLite for storing simulation output data.

We currently have two concrete implementations of IDatabaseConnection:

By using SQLKata as an SQL query builder, we can dynamically construct SQL queries for different relational databases using a common class - i.e. a single concrete implementation of IDatabaseConnection.

SQLKata currently supports SqlServer, MySql, PostgreSql, Oracle, Sqlite and Firebird.

hol353 commented 11 months ago

@zur003 : FYI

jbrider commented 11 months ago

We would like the option to be able to use the existing interface to have the option to save data on distributed systems using network calls. We're already tied closely enough to DataTable to make that use case difficult.

I think such a change would be useful for the UI to be able to read from a db, but it should be a front end implementation (reading), not necessarily part of the writing interfaces.

In that instance, is there a need to use a wrapper around a Dapper implementation rather than use an ORM directly. ie: Dapper vs EFCore? I don't know that much about Dapper, but I'm pretty sure it hasn't been around as long as EFCore which is directly supported by Microsoft. Looking at the SQLKata github page it looks like it gives you a syntax similar to Linq that EFCore provides out of the box.

her123 commented 11 months ago

The Firebird implementation was done with the idea of having a distributed system with a remote db server.

jbrider commented 11 months ago

Thanks @her123 I didn't actually realise that was why it was added.

We weren't loking at using a db at all - we were intending to look at other file formats at the end point such as netcdf for performance reasons. We haven't looked closely at it as yet though, and it may end up being a non issue.

Do we actually need to read data from the db during the model runtime? Even if we do, it shouldn't be that difficult to have a separate interface for writing, should it?

her123 commented 11 months ago

My experience with NetCDF is that it would probably not be great format for storing runtime information because of performance. There are other good alternatives such as time series databases that support spatial data.

jbrider commented 11 months ago

We have used NetCDF with data sets larger than 16Gb and found it to be excellent for using with visualisation tools. NetCDF is not really the point though - I don't think our interfaces should be restrictive to alternative implementation methods.

I don't have any issue with making them more tightly coupled from a front end perspective - but Apsim is run a lot without the GUI and the backend model should be kept as loosely coupled from the front end as possible.

JBris commented 11 months ago

@jbrider

I've discussed the idea of using the Entity Framework (and the idea of using an ORM) with @hol353, which I think makes a lot of sense if we want to incorporate an abstraction layer for different relational database backends

My understanding is that the schema for APSIM's simulation output data is relatively flexible, and is defined at run-time. So an ORM would bring about some complications if we wanted to map our output data to a POCO (i.e. you would need to account for this flexibility when designing the schema of the POCO). Hence, the use of an ADO.NET DataTable, which provides a lot more flexibility. Having said that, it looks as though it's possible to dynamically build the schema of an Entity using the EF's Fluent API.

The rationale for adopting the use of an SQL query builder for both read/write operations is that the schema of APSIM's output data can be defined directly using a DataTable object. If it's possible to use LINQ as a query builder, I think it makes sense to go with that instead.

jbrider commented 11 months ago

From the 10 minutes of reading I did about Dapper vs EFCore, it looks like Dapper would probably suit Apsim better given the nature of the output (mixed, unknown datatypes). I prefer EFCore, but more because I've been using it for a fair while.

I would stress that I don't think we should be adding any of that in the interfaces for the model outputting data - but it could make sense for the read/write functionality used in the front end.

I don't know the exact nature of the relationship between Report -> IDatabaseConnection, but would it be possible to have a simpler interface there, and then a more coupled interface for interacting (not just writing) with the DataStore?

hol353 commented 11 months ago

I think the plan is to simply implement IDatabaseConnection, using SQLKata so that different database backends can be used with APSIM. It won't be any more coupled than the current SqLIte implementation of IDatabaseConnection is.

r.e. Dapper and EFCore: I can't see how we can use object mapping technologies to store data into a database backend. They require classes to be defined at compile time whereas APSIM allows the user to define variables (database fields) in REPORT at runtime. Maybe I'm missing something.

I would stress that I don't think we should be adding any of that in the interfaces for the model outputting data - but it could make sense for the read/write functionality used in the front end.

Not sure what this means. Also not sure what you mean by 'front end' - the GUI?

I don't know the exact nature of the relationship between Report -> IDatabaseConnection, but would it be possible to have a simpler interface there, and then a more coupled interface for interacting (not just writing) with the DataStore?

Yes it would be good to simplify IDatabaseConnection. Remember that lots of models in APSIM use this interface (report, graph, predictedobserved, excel ...) so changing the interface isn't trivial.

jbrider commented 11 months ago

@hol353 From what I read about SQLKata on their github page they are implementing a wrapper around Dapper - so you will in essence be implementing an ORM.
Dapper appears to be able to handle using names rather than types - so it seems to be possible to use an ORM for that type of data - but I'm not sure what the performance cost might be. The general consensus (first few hits of google search) seems to be that Dapper is faster and more lightweight than EFCore - but that doesn't give us much of an insight into if it will be slower than what we currently use.

I was pretty sure that IDatabaseConnection was tied into lots of the program - but I think there could be a distinction between writing output from the model, and accessing/updating data in the DataStore by the GUI. Other than Report, all of those you mention are dealing with the GUI requirements - so maybe changing Report's usage of the interface is not all that difficult - I'm not sure. At it's core though, isn't Report just outputting a row of data at a time - an array of objects?

Datastore is essential for the GUI, but we don't need to use it all on our cluster runs, and the code we're using to extract the data is pretty clumsy at present - and has to be faster if we were to avoid the datastore completely. It's not our main bottleneck at present so we haven't spent much time looking into it, but given we're running 100's of millions of simulations at a time, every bit will help. I'm sure other people are using it in similar environments.

I'm not suggesting the current implementation of saving to the data store needs to change - just creating a separate interface for outputting from the model - that would allow other possible implementations to be more easily added. It shouldn't be difficult for IDatabaseConnection to implement that interface in order to retain the current functionality without needing a lot of changes?

I am aware it's not trivial, but if someone is going to be changing IDatabaseConnection then it seems like the appropriate time to consider it.

I don't have a problem with SQLKata - it looks interesting and am keen to have a closer look at it.

zur003 commented 11 months ago

It would clearly be good to have single API wrapping a range of database engines. There would be no need to know whether "quoting" of strings is done with single quotes, double quotes, or square brackets, and different database engines could be readily swapped. So it's a great idea in principle. However, I can think of several reasons why we would want to proceed with caution:

Limitations: All databases have limitations. For example, earlier versions of Firebird limited the length of column names to 31 characters, but we regularly have column names like "Barley.Phenology.CurrentStageName" which exceed that limit. I worked around that with Firebird by using a separate table to keep track of "long" column names, but that was a nuisance, and hard to get right. The current version of Firebird has increased that limit to 63 characters (more specifically, I think it's 63 UTF-8 encoded bytes so sometimes less than 63 characters), which is probably long enough for our purposes. But the point remains that there may be limitations of this sort which might constrain us in ways that are not immediately obvious.
Efficiency: A lot of effort has gone into making database access as fast as possible in APSIM, but much of that effort is specific to SQLite. For example, PagedDataProvider.cs makes use of SQLite's "rowid" field, which is specific to SQLite. We are likely to lose some performance if we move to an environment where optimisations of this sort cannot readily be made.
(or maybe 2a). Prepared queries: I'm not sure whether SQLKata makes provision for prepared queries, which should generally be faster than non-prepared counterparts for repeated operations (like insertion of new rows of data).
(or maybe 2b). Concurrency: Is SQLKata thread safe? If so, can it readily make use of multiple threads, or is safety obtained by blocking so that only one thread is doing useful work at a time? If multi-threaded, does each thread require a separate connection?
Transactions: Does SQLKata support transactions? Well obviously it must underneath, but does it provide user access to transaction logic?
Scheme querying: Does SQLKata provide a mechanism for getting information like the names of tables within a database, and the names of columns within a table? (I suspect that it does, but I don't know for sure.) That is essential for us.

None of these points are "show-stoppers", but we do need to make sure we don't sacrifice performance for coding efficiency.

par456 commented 11 months ago

Ignore that reference, the pull request attached the wrong issue number.

hol353 commented 11 months ago

Thanks @zur003 - all good points. I think I understand @jbrider now as well. He is proposing that we need to be able to read/write data through a very simple software interface so that we can easily implement writing to say a .csv file or NetCDF or anything. In theory the 'writing' interface could just be a simple Write method that takes columns/rows of data as arguments. e.g.

interface IDataWriting
{
   void WriteTable(DataTable table)
}

This would be super simple to implement. In practice there are many nuances that make it complicated - some of which @zur003 outlines above. We should also have a simple 'reading' interface as well that supports paging of data. Quite a few models read their data via SQL that is quite specific to SQLite. This needs to change.

Yes, we need to proceed with caution and yes we need to make sure we don't sacrifice performance.

JBris commented 11 months ago

@jbrider

SQLKata doesn't have to make use of any ORM features - i.e. defining the schema of a data model at compile time, as @hol353 alluded to. It can be used purely as an SQL builder with anonymous objects or IEnumerables. A simple example is linked over here - this code can be altered to map APSIM output data to the columns and rows of a DataTable

@jbrider @hol353 @zur003 I agree with your other points

APSIMInitiative / ApsimX

SQLKata integration for storing simulation outputs with different relational databases #8159

Describe the new feature