NLeSC / guide

Software Development Guide
https://guide.esciencecenter.nl
Creative Commons Attribution 4.0 International
47 stars 30 forks source link

(Python) database tools / chapter? #316

Closed egpbos closed 1 month ago

egpbos commented 1 year ago

The Python chapter contains a section "Database Interface" that doesn't seem helpful. As far as I know, what we use from Python to access databases is typically SQLAlchemy, as @bouweandela suggested. I have used this personally in the past, but it's been a long time and I don't feel confident enough on this topic to write more than just "SQLAlchemy good".

Does anyone with more recent experience want to write something about this? @HannoSpreeuw maybe? Needn't be more than a single sentence, but if you have good tips ("ALWAYS rtfm first" or "NEVER rtfm, it's horrible, try this instead" or...), that would be great as well.

Also, if anyone disagrees with the above assessment (that the currently listed database interface tools are essentially useless to our typical usecases), please correct me!

egpbos commented 1 year ago

Btw, for the historians among you, it seems like the original contributor of this list is lost in the mists of time ;) https://github.com/NLeSC/guide/blame/52cf8b2fe97f571cff745f3d8a60d915deb5b4dd/software/language_specific_info.md

HannoSpreeuw commented 1 year ago

Sorry, I do have some experience with SQLAlchemy from PADRE, but that was mainly about getting it to work after a Python 2 --> 3 conversion. What you need is an opinion about whether SA is the preferred Object Relational Mapper. I truly don't know.

egpbos commented 1 year ago

How about @suvayu?

I think @LourensVeen has the theoretical background on ORMs, but is not familiar with SA, or at least that's the last I heard... Perhaps still enough to make a sensible recommendation on this?

LourensVeen commented 1 year ago

I have extensive but outdated experience with PostgreSQL, and I'm using sqlite3 from Python directly through its SQL-based API. I could probably do some research and write a reasonable recommendation, but I'm also swamped again, so at the earliest in 2024...

suvayu commented 1 year ago

I'm guessing this is the section you are referring to? My experience with both databases and SQLAlchemy is only at the level "I know enough to debug, and spot obvious mistakes".

That said, one (somewhat) low effort approach could be someone moderately experienced go through related awesome lists and select a few recommendations.

Another point would be, not to stop at just interfaces, but mention other related tools like db documentation, query optimisers/explainers, database modelling hints, and convenience utilities like DBCli.

I'll be back next week. I can help as long as someone else also joins me.

egpbos commented 1 year ago

@LourensVeen so what you are saying is that sqlite3 actually is a useful one to keep there?

@suvayu ok, this is an interesting direction. It's sounding more like a chapter on databases, though. Perhaps actually something like that is better than listing random libraries that nobody may ever use? I also remember from my brief time doing MySQL + SQLAlchemy that we ended up spending a lot of time just doing raw SQL and indeed interfacing directly with the MySQL CLI tool or e.g. through good old phpMyAdmin.

I would say then that the database interfaces list can simply be removed from this chapter. If people know the general way of working with databases and they need either an ORM or some direct interface between the DB they chose and the language they use, then Google will deliver it to them. Ok, maybe it makes sense to list popular ORMs in such a chapter, among which would be SQLAlchemy, but that would be the extent of the Python-/language-specific needs then.

Agree? Disagree? Things I missed?

egpbos commented 1 year ago

@wrvhage What does the Data SIG think, is this something that makes sense?

Note btw that new Guide contribution guidelines will soon be online, so please don't write a database chapter yet ;)

LourensVeen commented 1 year ago

sqlite is excellent, and definitely recommended if you have a use case that fits it. For lots of data or complex queries or if you need a separate server, then PostgreSQL is better, but it does have a separate server so it makes your application context more complex. And then of course there are column stores and NoSQL databases that may or may not be a better match than SQL. SQL often gets overlooked these days, which is a shame because there are definitely still lots of use cases where it's just the right solution.

suvayu commented 1 year ago

My impression is for research use cases, these days duckdb is probably a good choice, since the data is usually static and it's more about analytics.

f-hafner commented 10 months ago

I came across this thread by chance, and thought I added my two cents from one of my previous research projects. There, we've been using sqlite extensively and over time built a database of almost 1TB with many tables and relatively complex queries.

What we have been struggling is aggregation queries -- they are quite slow in sqlite for this size of db. I recently experimented with DuckDB, and I could connect to the sqlite existing database and perform some aggregation queries 10x faster than through sqlite.

In short, I think combining DuckDB and sqlite could be promising in many use-cases. I guess @suvayu is right that the more static the data, the more useful is DuckDB. But if you are building a db from scratch or adding a lot of tables (which at least in the social sciences is not uncommon), I think sqlite is better because it can persist row indices which I think DuckDB cannot. For analytics, it seems one can then still use DuckDB, but perhaps one could explore the combination of the two a bit more.

suvayu commented 10 months ago

On Thu, 25 Jan, 2024, 11:10 f-hafner, @.***> wrote:

In short, I think combining DuckDB and sqlite could be promising in many use-cases. I guess @suvayu https://github.com/suvayu is right that the more static the data, the more useful is DuckDB. But if you are building a db from scratch or adding a lot of tables (which at least in the social sciences is not uncommon), I think sqlite is better because it can persist row indices which I think DuckDB cannot. For analytics, it seems one can then still use DuckDB, but perhaps one could explore the combination of the two a bit more.

Interesting point about row indices. AFAIU, DuckDB doesn't support indices because it doesn't need it. Since it's an OLAP db, the query execution engine is designed to query any column without relying on indices like traditional DBs (this is based on a Q&A with one of the devs at a PyData talk). It depends a lot more on column types instead.

@f-hafner, if you are up for it, we could do a test with your example db. I think instead of just connecting to the sqlite db from duckdb, we can import the data and compare query performance.

Unfortunately all public comparisons are too synthetic to base general recommendations for real world use.

-- Suvayu

f-hafner commented 10 months ago

I think I tried to replace the entire sqlite db with DuckDB in the past, but then abandoned the idea because I could not persist the row indices, but perhaps I should have tried a little more.

so yes, @suvayu , let's have a look together. I can't share the data publicly, but can give you access or prepare an extract that I can share.

f-hafner commented 7 months ago

@suvayu and I will work on the following

Issues to address for db comparisons duckdb vs sqlite

egpbos commented 1 month ago

With the datasets chapter merged and with #339 opened (whether or not it is merged doesn't really matter, we will decide in the PR), this issue can be closed, thanks all for the discussion! For new additions, or if you want to revisit some particular points from this thread, feel free to open a new dedicated issue (this one has become a bit tangled).