ckan / ideas

[DEPRECATED] Use the main CKAN repo Discussions instead:
https://github.com/ckan/ckan/discussions
40 stars 2 forks source link

Switch to MongoDB #82

Closed davidread closed 9 years ago

davidread commented 10 years ago

Why not replace our database with a JSON-oriented store like MongoDB? By removing the dictization code surrounding most db calls, it would remove a bit of code complexity and really speed up lots of operations which are currently not scaling well. And making the dataset structure more flexible will help mapping properly to hierarchical metadata standards found in DCAT and ISO19119.

It would be great to get thoughts on this. I guess it's weighing up whether the effort is worth the advantages. BTW I have no current experience of Mongo, but hopefully the gist is right.

Advantages:

To discuss:

Work:

Disadvantages:

adamamyl commented 10 years ago

mumbles something about "maybe one for CKAN 3.x, especially if CKAN3 gets written from scratch" (personal view, not an OKF one).

wardi commented 10 years ago

More use of postgres JSON features sounds good, like a single JSON field for extras instead of package_extras etc. I quite like postgres, we just need to be using it better.

rossjones commented 10 years ago

I think the idea is worthy of some experimentation. I agree with @adamamyl that fitting it into the current release cycle would be prohibitively hard/time-consuming/expensive, so perhaps we should have a prototype as the basis for discussion on what 3.0 looks like. I'm not overly keen on the idea of having to install postgres and mongo (some other document store) side-by-side. Surely one of the targets for 3.0 should be simpler install?

Overall I'm a +1 on trying it out. Not necessarily because I think performance improvements will come from using mongo (I'm sure by the time journalling and fsync etc are turned on it won't be) but as David points out, from the amount of code that could be deleted.

Aside: How much of spatial actually requires postgis? How much could be done with Solr or ES geo filters?

teajaymars commented 10 years ago

Thanks for writing this up David. My perspective: If I were to start writing CKAN from scratch today, I would without question use a Document Database rather than something relational. Because in my experience CKAN is mostly about storing and retrieving a few kinds of documents, and executing Solr search across them. High-performance querying or clever join queries aren't a common use case; in fact the most common thing people do with CKAN is to populate package_extras and resource_extras because each new deployment requires a slightly different data model.

In fact, the existence of those *_extras fields (and their extreme usefulness) really hints that this is a flexible, extensible document storage engine with an awesome frontend and schemas and validation and all the other valuable features. Very few features take advantage of the relational data model, and none of them require it.

Switching to Mongo would make the application stack much more consistent and much easier to extend. (ie. easier for everybody who isn't core CKAN team!). Also to echo @rossjones we could delete a lot of code :-)

amercader commented 10 years ago

Even if MongoDB fitted the JSON side of things nicely there is still an awful lot of relational things on CKAN: orgs membership, ...

I think we could gain a lot by exploring Postgres JSON field, without having to start a refactoring that is unlikely to happen soon.

Aside: How much of spatial actually requires postgis? How much could be done with Solr or ES geo filters?

Not much, in fact if you are using the newest Solr based search backends (which you should if you have a significant number of datasets) you could perfectly avoid completely the need for PostGIS. I'd be happy to implement this if it helped / was useful

teajaymars commented 10 years ago

@amercader I'd need more persuading that those relationships really constitute a hard requirement for a relational database. You can just store an array of organisation IDs on a user document. The less common and more complex case is where you ask for "all the users who are members of this organisation", but MongoDB supports this -- you'd ask for all user documents where the "membership" array contains the given organisation ID.

A disclaimer; I've not yet built a project on a document database. But they seem to be targeted at the problem of extensible data models, projects which get really messy when you try to force them into a relational database. I agree the refactor would be so hard that it's basically academic, but -- academically -- I'd say CKAN's architecture would be much better served by a document DB than a relational DB. Unless there's something genuinely high-performance or infeasible which requires Postgres.

wardi commented 10 years ago

-1 from me, even for a hypothetical CKAN 3.0

I have much more faith in the skill of the postgres developers. If I lose data I'd like it to be my fault, not my data store's fault. Also it seems any time someone tries an apples-to-apples comparison of postgres and mongo, postgres ends up faster regardless.

Postgres already supports JSON text (faster read/write) and binary (faster select operations) and general-purpose indexes are rapidly improving (indexes on individual json elements already supported).

On the python side, SQLAlchemy is just awesome. We could use it better to get real atomicity when updating multiple complex related objects. With a small amount of work we can also get some really big performance gains.

Lastly, for system tools and developer expertise I doubt any data store will be able to match what's available for postgres.

jpmckinney commented 9 years ago

In my experience, I've deleted more code switching from Mongo to Postgres than the other way around. Mongo is terrible at ensuring data quality - you have do to it all yourself. With Postgres, you have foreign keys, for example, which are easy to take for granted until you start writing all sorts of code to ensure your Mongo database stays consistent. (Also, have you ever tried migrating Mongo? Remember it has no transactions.) @wardi is right - if you use Postgres correctly, you can cut code and have a good quality DB.

I think there are also too many extensions to CKAN to change something as fundamental as the DB.

I'll add that if the plan is to have both Mongo and Posgres - well, now you have two problems.

davidread commented 9 years ago

This has been a worthwhile look at the issues and I have come to some conclusions in favour of keeping Postgresql:

So I'm closing this. Thanks all.