CorrelationDB - Githubissues

In several places we need a large amount of constant data to correlate vulnerabilities, etc. For example the OpenVAS plugin uses a database to match OpenVAS plugins to GoLismero vulnerabilities. We'll also need a database for other features like reverse CPE lookups, or matching MAC addresses to hardware vendors, or correlating the different vulnerability IDs (CVE, OSVDB, etc.).

My proposal is to create a single mechanism to access all of this information in one place. The exact programming interface and implementation are up for discussion of course :) but here are my thoughts on it:

The /data folder can be used to store all this data as a binary file. Our code could detect the presence of database file(s) there and load it/them more or less automatically.
SQLite3 can be used for this, as it's already shipped with Python, fast, and can pack everything into a single file if needed. AnyDBM is also fast, but not as much, and definitely not as flexible. CodernityDB could also be used if it proves to be faster (unlike the Audit database, compatibility here won't matter since we control the database, not the user). Server databases would be a bad idea, IMHO.
Updates would then consist of downloading new versions of this (or these) file(s). We could use Git for this, but it's probably best to have GoLismero download the file(s) from the webpage instead.
In principle we could put everything into a single file for performance reasons, but it could make it slow for updates, and harder to extend for plugins. On the other hand, using multiple files means having to keep several files open all the time, so it could be wasteful in memory. I guess some benchmarks are needed.
I think it's best not to use an ORM for this one. We wouldn't benefit much from it, since we can't pass around the ORM objects to the plugins anyway (not serializable) and if the plugins load the database locally then we'd have much more memory and resource consumption. Let's keep it as lightweight as we can, the queries should be stupid simple anyway. :)
Also we already have code for handling SQLite3 transactions automatically.
From GoLismero's point of view, the database would be strictly read-only. And when updating we should just download files without having to understand what's in them. (Using SQL scripts for updates, for example. seems to me like a bad idea in the long run).
So far our use cases would be (add as needed):
- MAC address to vendor lookup: this is so simple in can even be done by AnyDBM. About 1Mb of data.
- CPE lookups: this is more complex, probably needs a database capable of "fuzzy" searches. Maybe we can find a shortcut for specific cases (nmap for instance, we could probably extrapolate all possible output values from the nmap sources).
- Vuln ID correlation: easy to query, the complicated part is creating the database, not using it. Large dataset, very frequent updates.
- OpenVAS correlation: easy to query, we have the script to create the database already (it uses Django's ORM, but it should be simple to remove that if needed). Small dataset for now, but it could grow as we add support for multiple versions of OpenVAS. Rarely updated (when new versions are added, or new OpenVAS plugins are written). Unlike the above cases, this one belongs strictly to a plugin, not to the core.

So this would be a proposed implementation:

A base class to handle a common interface for all databases, and automatic handling of SQLite3 transactions. We already have this in /golismero/database/common.py so 0 effort here! ;)
A function to look up database files in /data (it's done in 5 lines of Python, no biggie either).
Somewhere in our web page, a place to host the database files so they can be downloaded. (I suppose Github could be used instead, I don't really mind, whatever works best). The download location should be configurable, both to make it more flexible to the user, and easier for us when testing the stable/testing/devel versions without mixing things up.
Changes to the UPDATE command in golismero.py to download the database files after updating the code. A trivial implementation to get something running quick should be easy, it's only a few lines with urllib2. The tricky part is making the download secure - using SSL certificates and checking them correctly, handling errors gracefully, retrying interrupted downloads efficiently, dealing with shitty HTTP proxies that cache too much, etc. Since the update functionality is rather crucial, this one is going to take a while - not so much time spent coding, but testing in a variety of environments.
Each individual database would require a new class, deriving from the base class (that does much of the work). The new class would only contain the DAO methods and the queries, like I said earlier, all the work of finding the database file and opening it only when the first query is made, keeping it open only as long as needed to, managing transactions, etc. would be done by the base class transparently.
Really simple databases, like the MAC address database, would imitate a Python dictionary for example.
Plugins would be able to create database classes like this too. For example the OpenVAS plugin would define its own class for the database access. That way everything stays encapsulated. (Maybe for the update mechanism we'll want plugins to provide their own download locations too. We'll see how to do that when we design the plugin repository).
Important: these databases should be strictly optional. Not all users will want to download a lot of megs of database files. This again connects with the idea of plugin repository - maybe we'll want to think of individual databases like some sort of packages that the user can install or uninstall. But for a first implementation, this will only mean that if the database files are removed nothing should break, and GoLismero shouldn't be obnoxious and try to "help" you and download them again (users would say "hey, I deleted that for a reason dammit!).

I'm happy with all of the above except for the part about using SQLite3. It seems to me like the easiest choice given our existing code base, but there may be better ways to do it. (Now's a good chance to invest time in learning the benefits of Codernity, I guess!).

So, there's the idea. Discuss. :)

(@cr0hn después de leerte todo este churraco aprendes inglés FIJO, te me vas pa'Jarvar y tó, chaval xDDD)

cr0hn / golismero-legacy

CorrelationDB #221