Open JBris opened 11 months ago
@zur003 : FYI
We would like the option to be able to use the existing interface to have the option to save data on distributed systems using network calls. We're already tied closely enough to DataTable to make that use case difficult.
I think such a change would be useful for the UI to be able to read from a db, but it should be a front end implementation (reading), not necessarily part of the writing interfaces.
In that instance, is there a need to use a wrapper around a Dapper implementation rather than use an ORM directly. ie: Dapper vs EFCore? I don't know that much about Dapper, but I'm pretty sure it hasn't been around as long as EFCore which is directly supported by Microsoft. Looking at the SQLKata github page it looks like it gives you a syntax similar to Linq that EFCore provides out of the box.
The Firebird implementation was done with the idea of having a distributed system with a remote db server.
Thanks @her123 I didn't actually realise that was why it was added.
We weren't loking at using a db at all - we were intending to look at other file formats at the end point such as netcdf for performance reasons. We haven't looked closely at it as yet though, and it may end up being a non issue.
Do we actually need to read data from the db during the model runtime? Even if we do, it shouldn't be that difficult to have a separate interface for writing, should it?
My experience with NetCDF is that it would probably not be great format for storing runtime information because of performance. There are other good alternatives such as time series databases that support spatial data.
We have used NetCDF with data sets larger than 16Gb and found it to be excellent for using with visualisation tools. NetCDF is not really the point though - I don't think our interfaces should be restrictive to alternative implementation methods.
I don't have any issue with making them more tightly coupled from a front end perspective - but Apsim is run a lot without the GUI and the backend model should be kept as loosely coupled from the front end as possible.
@jbrider
I've discussed the idea of using the Entity Framework (and the idea of using an ORM) with @hol353, which I think makes a lot of sense if we want to incorporate an abstraction layer for different relational database backends
My understanding is that the schema for APSIM's simulation output data is relatively flexible, and is defined at run-time. So an ORM would bring about some complications if we wanted to map our output data to a POCO (i.e. you would need to account for this flexibility when designing the schema of the POCO). Hence, the use of an ADO.NET DataTable, which provides a lot more flexibility. Having said that, it looks as though it's possible to dynamically build the schema of an Entity using the EF's Fluent API.
The rationale for adopting the use of an SQL query builder for both read/write operations is that the schema of APSIM's output data can be defined directly using a DataTable object. If it's possible to use LINQ as a query builder, I think it makes sense to go with that instead.
From the 10 minutes of reading I did about Dapper vs EFCore, it looks like Dapper would probably suit Apsim better given the nature of the output (mixed, unknown datatypes). I prefer EFCore, but more because I've been using it for a fair while.
I would stress that I don't think we should be adding any of that in the interfaces for the model outputting data - but it could make sense for the read/write functionality used in the front end.
I don't know the exact nature of the relationship between Report -> IDatabaseConnection, but would it be possible to have a simpler interface there, and then a more coupled interface for interacting (not just writing) with the DataStore?
I think the plan is to simply implement IDatabaseConnection, using SQLKata so that different database backends can be used with APSIM. It won't be any more coupled than the current SqLIte implementation of IDatabaseConnection is.
r.e. Dapper and EFCore: I can't see how we can use object mapping technologies to store data into a database backend. They require classes to be defined at compile time whereas APSIM allows the user to define variables (database fields) in REPORT at runtime. Maybe I'm missing something.
I would stress that I don't think we should be adding any of that in the interfaces for the model outputting data - but it could make sense for the read/write functionality used in the front end.
Not sure what this means. Also not sure what you mean by 'front end' - the GUI?
I don't know the exact nature of the relationship between Report -> IDatabaseConnection, but would it be possible to have a simpler interface there, and then a more coupled interface for interacting (not just writing) with the DataStore?
Yes it would be good to simplify IDatabaseConnection. Remember that lots of models in APSIM use this interface (report, graph, predictedobserved, excel ...) so changing the interface isn't trivial.
@hol353 From what I read about SQLKata on their github page they are implementing a wrapper around Dapper - so you will in essence be implementing an ORM.
Dapper appears to be able to handle using names rather than types - so it seems to be possible to use an ORM for that type of data - but I'm not sure what the performance cost might be. The general consensus (first few hits of google search) seems to be that Dapper is faster and more lightweight than EFCore - but that doesn't give us much of an insight into if it will be slower than what we currently use.
I was pretty sure that IDatabaseConnection was tied into lots of the program - but I think there could be a distinction between writing output from the model, and accessing/updating data in the DataStore by the GUI. Other than Report, all of those you mention are dealing with the GUI requirements - so maybe changing Report's usage of the interface is not all that difficult - I'm not sure. At it's core though, isn't Report just outputting a row of data at a time - an array of objects?
Datastore is essential for the GUI, but we don't need to use it all on our cluster runs, and the code we're using to extract the data is pretty clumsy at present - and has to be faster if we were to avoid the datastore completely. It's not our main bottleneck at present so we haven't spent much time looking into it, but given we're running 100's of millions of simulations at a time, every bit will help. I'm sure other people are using it in similar environments.
I'm not suggesting the current implementation of saving to the data store needs to change - just creating a separate interface for outputting from the model - that would allow other possible implementations to be more easily added. It shouldn't be difficult for IDatabaseConnection to implement that interface in order to retain the current functionality without needing a lot of changes?
I am aware it's not trivial, but if someone is going to be changing IDatabaseConnection then it seems like the appropriate time to consider it.
I don't have a problem with SQLKata - it looks interesting and am keen to have a closer look at it.
It would clearly be good to have single API wrapping a range of database engines. There would be no need to know whether "quoting" of strings is done with single quotes, double quotes, or square brackets, and different database engines could be readily swapped. So it's a great idea in principle. However, I can think of several reasons why we would want to proceed with caution:
None of these points are "show-stoppers", but we do need to make sure we don't sacrifice performance for coding efficiency.
Ignore that reference, the pull request attached the wrong issue number.
Thanks @zur003 - all good points. I think I understand @jbrider now as well. He is proposing that we need to be able to read/write data through a very simple software interface so that we can easily implement writing to say a .csv file or NetCDF or anything. In theory the 'writing' interface could just be a simple Write method that takes columns/rows of data as arguments. e.g.
interface IDataWriting
{
void WriteTable(DataTable table)
}
This would be super simple to implement. In practice there are many nuances that make it complicated - some of which @zur003 outlines above. We should also have a simple 'reading' interface as well that supports paging of data. Quite a few models read their data via SQL that is quite specific to SQLite. This needs to change.
Yes, we need to proceed with caution and yes we need to make sure we don't sacrifice performance.
@jbrider
SQLKata doesn't have to make use of any ORM features - i.e. defining the schema of a data model at compile time, as @hol353 alluded to. It can be used purely as an SQL builder with anonymous objects or IEnumerables. A simple example is linked over here - this code can be altered to map APSIM output data to the columns and rows of a DataTable
@jbrider @hol353 @zur003 I agree with your other points
Describe the new feature
APSIM makes use of SQLite for storing simulation output data.
We currently have two concrete implementations of IDatabaseConnection:
By using SQLKata as an SQL query builder, we can dynamically construct SQL queries for different relational databases using a common class - i.e. a single concrete implementation of IDatabaseConnection.
SQLKata currently supports SqlServer, MySql, PostgreSql, Oracle, Sqlite and Firebird.