cr0hn / golismero-legacy

THIS IS A LEGACY VERSION PRESERVED FOR BACKUP, DO NOT USE
http://golismero-project.com
15 stars 4 forks source link

Migration to a NoSQL database #197

Open MarioVilas opened 10 years ago

MarioVilas commented 10 years ago

We are currently using a SQLite database with a wrapper to simulate an object-oriented graph database on top of it. We would greatly benefit from using a NoSQL database instead.

MarioVilas commented 10 years ago

This is what CouchDB looks like from Python: http://wiki.apache.org/couchdb/Getting_started_with_Python

But this greatly improves things: http://nicolaisi.github.io/couchquery/

Still, no graph support it seems, and the IDs are completely independent of our IDs. At some point we'd have to use JavaScript for some stuff.

MarioVilas commented 10 years ago

MongoDB seems pretty much identical: https://api.mongodb.org/python/current/tutorial.html

Apparently no JavaScript would be needed, all we need can be done from PyMongo.

There are also ORM-like layers on top of it, but I'm not sure they'd be too useful for us: http://api.mongodb.org/python/current/tools.html#orm-like-layers

MarioVilas commented 10 years ago

Bulbflow is a graph database API. Seems fairly simply to use, but requires a fixed model to work with, so it may be more complicated to use for us since having no schema would be a major advantage.

http://bulbflow.com/overview/

Underneath there's Neo4j (requires Java, that's a big no-no) or alternatively Rexster, which is also an API to other databases: https://github.com/tinkerpop/rexster/wiki

Skipping the ORM-like Bulbflow it probably not a good idea. This is what Rexster looks like on Python (yuck!): https://github.com/bdeggleston/rexpro-python

MarioVilas commented 10 years ago

Redis could be interesting. It's key-value based. http://redis.io/

There's a client for Python: https://github.com/andymccurdy/redis-py

A wrapper to implement graphs on top of it: https://pypi.python.org/pypi/redis_graph

And it also works as a backend to Bulbflow.

It can apparently be used as a communication channel too, but I need to look more into it: http://redis.io/commands/subscribe

MarioVilas commented 10 years ago

I haven't looked much into Titan, but it seems like an overkill: http://thinkaurelius.github.io/titan/

MarioVilas commented 10 years ago

FlockDB seems interesting, but there are two major drawbacks: 1) it's still unstable and 2) it's use case are shallow graphs only, and I'm not sure if we want to keep our graphs shallow forever. Having said that, it's also unlikely we'll often need graph operations past the first edge.

https://en.wikipedia.org/wiki/FlockDB

MarioVilas commented 10 years ago

OrientDB has an extension of the SQL syntax, so it may be easier to port our code: http://www.orientdb.org/

It also supports a REST API just like Mongo and Couch.

Unfortunately it's implemented in Java, and I'm not sure we want that. I've also found little documentation on how to access it from Python.

MarioVilas commented 10 years ago

This presentation goes through several graph databases for Python: http://www.slideshare.net/versae/graph-databases-in-python-pycon-canada-2012#btnNext

MarioVilas commented 10 years ago

Buzhug could be useful, albeit limited. The main advantage is it's pure Python. http://buzhug.sourceforge.net/

MarioVilas commented 10 years ago

Another pure-Python database, PyDBLite: http://www.pydblite.net/en/PyDbLite.html

Seems to be identical to Buzhug.

MarioVilas commented 10 years ago

MetaKit seems completely uninteresting, it's like SQLite but without SQL language support. I don't even understand why it exists. http://equi4.com/metakit/python.html

MarioVilas commented 10 years ago

CodernityDB is fast and pure-Python, but I can't quite grasp the design. Seems like a weird offspring of BerkeleyDB, and it exposes way too many implementation details. I think it may be more suitable for other tasks. It's also still in beta. http://labs.codernity.com/codernitydb/quick.html

MarioVilas commented 10 years ago

HTD, another sqlite-like database. Not quite sure what's the point of this either. http://www.hayzentech.co.uk/htd/doc/3.1/htd-core/examples.html

MarioVilas commented 10 years ago

Kineta is a document oriented database similar to Mongo or Couch, but it's implemented using BerkeleyDB under it. Seems interesting. No idea how well it performs, though. https://code.google.com/p/kineta/wiki/GettingStarted

MarioVilas commented 10 years ago

Ok, summing up. The best alternative seems to be Bulbflow. However it requires a more complicated setup, so it's not really a replacement for the current SQLite implementation. The ones that could replace SQLite don't offer enough in return to justify the migration. So the best solution is to keep both the current implementation and a new one based either on Bulbflow or on a standard SQL database using an ORM.

To reduce the costs of maintenance we could drop the in-memory database implementation, and just use SQLite in-memory databases instead. This would involve mostly removing code, not adding new one, since we're explicitly avoiding the SQLite in-memory implementation out of performance concerns - but those concerns haven't proven justified.

This also means one of the main gains of migrating out of SQLite (consistency checks) will have to be implemented in the SQLite DAO now. This probably involves getting the links out of the Data objects and into a many-to-many table.

Additionally (and this had to be done anyway) we need to make sure all database writes are done in batches, so there are not dangling references. To ease the transition we could consider a broken reference a warning, not an error, and fix it later when we find the reference.

jekkay commented 10 years ago

The installation of mogodb is very simple. I write a simple plugin that support mongodb with pymongo, and works well. As my local code is much different from develop repo, and i have not much time to test on latest code, so i have not commit the code. I will be glad to share it by email if needed.

MarioVilas commented 10 years ago

One of the things we aim at is making GoLismero work without installation, if possible, even if some features are missing - that's why I'm still not convinced of dropping SQLite.

Mongo and similar databases seem quite interesting! I was hoping for something with built-in support for graphs though :) but at some point we may want to integrate MongoDB as an alternative to SQLite.

So far the one I liked the most was Bulbflow, but it seems to require a lot of setup. It may be interesting for making a Linux distro with everything built in and configured, but again, not as a replacement of SQLite.

cr0hn commented 10 years ago

Hi guys,

I think we should make an abstraction layer for NoSQL databases using a DAO pattern. The most common operations are shared between all nosql databases. Using these DAO pattern we can change between databases.

As Mario says, one of the goal of the project is to no have binary dependences.

After read the options, I think the CodernityDB can be better option:

What do you think?

El 06/12/2013, a las 15:30, Mario Vilas notifications@github.com escribió:

One of the things we aim at is making GoLismero work without installation, if possible, even if some features are missing - that's why I'm still not convinced of dropping SQLite.

Mongo and similar databases seem quite interesting! I was hoping for something with built-in support for graphs though :) but at some point we may want to integrate MongoDB as an alternative to SQLite.

So far the one I liked the most was Bulbflow, but it seems to require a lot of setup. It may be interesting for making a Linux distro with everything built in and configured, but again, not as a replacement of SQLite.

— Reply to this email directly or view it on GitHub.

MarioVilas commented 10 years ago

There's already a DAO :)

https://github.com/cr0hn/golismero/blob/master/golismero/database/auditdb.py

I didn't like Codernity, the design seems weird and it's still a beta. Plus I found no benefit over the current implementation (does the same things only faster, because SQLite is in C) and SQLite can be used with Django or anything else, Codernity doesn't use any standards that I know of (please correct me if I'm wrong).

cr0hn commented 10 years ago

Really I not tested CodernityDB. Maybe we must test it before.

A nosql database can benefit that we can search in database without deserialize each data before.

Add an ORM (I like this option so much), like django ORM, may be the great option, because a change of the database engine is trivial. Other benefits of use the ORM is that, all the plugins that uses databases (currently only OpenVAS, and it uses sqlite3) can use the same database server.

El 06/12/2013, a las 18:12, Mario Vilas notifications@github.com escribió:

There's already a DAO :)

https://github.com/cr0hn/golismero/blob/master/golismero/database/auditdb.py

I didn't like Codernity, the design seems weird and it's still a beta. Plus I found no benefit over the current implementation (does the same things only faster, because SQLite is in C) and SQLite can be used with Django or anything else, Codernity doesn't use any standards that I know of (please correct me if I'm wrong).

— Reply to this email directly or view it on GitHub.

MarioVilas commented 10 years ago

Yes, but the point isn't just to persist the data, but to have the database do the integrity checks as well. Graph traversal isn't very efficient with SQL, but we're managing so far. The biggest drawback of SQL is having to use an ORM - they don't help us at all! We don't have a fixed schema, and in order to use the Django SQL we'd need something like what's described in ticket #99, which is a lot of work.

On the other hand, most of the document based databases (mongo, couch) are so extremely similar we won't have problems switching between them, and there's no need for an ORM - objects would be stored directly. Django can also operate with those databases (not so with Codernity as far as I know). The drawback is we'll need to implement the integrity checks and the graph traversal ourselves, like we do with SQL, see ticket #167.

Graph databases behind Bulbflow also have a single interface so we would be able to switch them with no effort too, and they also don't impose a fixed schema. SQL is the big loser here.

MarioVilas commented 10 years ago

Another thing to consider: eventual consistency. Right now we're serializing everything on the Orchestrator, so all RPC calls and database accesses are sequential. We could think about how to work with eventual consistency instead (that is, plugins would truly work in parallel for most things, and the database would solve conflicts in parallel too).

In theory nothing should change for us if we just send the output of plugins to a background process that solves the conflicts (since it has to be done in Python, we can't just offload that to the database engine). To make it scale when we add support for nodes, it should also be possible to solve some conflicts within a node before sending the data to the master node. All this would make some of our merge strategies meaningless, since we wouldn't have "before" and "after" anymore, so merges would have to be based entirely on the contents of the data and not on the order it arrives.