Yet Another Debate: Should we use a database instead of flat-files?

ArtOfCode- commented 7 years ago

This could apply to either Classic or NG, so I'm leaving it here.

Context: Smokey currently saves most of its state data in local pickle files:

$ ls *.p *.pickle
apiCalls.pickle     bodyfetcherMaxIds.p        falsePositives.p  notifications.p     whyData.p
autoIgnoredPosts.p  bodyfetcherQueue.p         ignoredPosts.p    whitelistedUsers.p
blacklistedUsers.p  bodyfetcherQueueTimings.p  latestMessages.p  whyDataAllspam.p

Proposal: Should we use a database for storing all (or most) of this state data, or should we stick to flat files? Why/not? What do we gain/lose?

(If you were there for the Great Database Debate, assume this is SQLite, because of lack of install/maintenance.)

ArtOfCode- commented 7 years ago

I personally am pro-database.

We're assuming SQLite, which comes bundled with Python and has a good ORM available for it. No extra maintenance is involved.
We gain the ability to modify things like privilege levels and blacklists on-the-fly, without code edits, and without restarting. Those kind of things shouldn't technically require code edits, because they're not technically code, so that's a win.
This is exactly the kind of thing databases are intended for - we have a bunch of well-structured, serializable and normalizable data, which is a perfect candidate for storing in a database.
We have less 'clutter' between code files and data files in a running Smokey's directory - we have one database file that contains all the data, and everything else is code or configuration.

NobodyNada commented 7 years ago

TL;DR Considering the amount of data we're storing and the structure of the data, I'd say yes.

As a developer about FireAlarm, I've thought about this before, but I haven't really given it much consideration.

FireAlarm needs to store:

A list of all the users it knows about, including their user IDs, notification options, and privileges.
A list of all reported post IDs and the threshold for each one.
The probabilities for the naive Bayes filter.

Right now, FireAlarm keeps everything in memory, periodically saving to the files. This was convenient to code, but it has the following disadvantages:

I have to worry about saving and loading the files periodically, and handling errors when it fails (and sometimes the errors are expected because the files have not been created yet).
Lack of structure. JSON doesn't have any sort of "schema," so I have to validate everything when loading it.
It's painful to change the format. I either have to make new code able to read older formats, manually migrate the files, or just drop everything and start over.
It's slow. This doesn't matter for the user or report information, or on a powerful computer, but the filter takes over 30 seconds to load on the Raspberry Pi.

A database solves these problems:

The database will take care of saving and loading, and error-handling should be easier.
The database is highly structured. It's not possible to store anything which doesn't fit the schema.
I can use a simple migration system to change the format, like Rails does, because of the database's efficiency, capabilities, and structure.
It won't be slow anymore. Databases are magic.

Those points are also valid for SmokeDetector, but on a larger scale. Smokey needs to store a lot more data, and everything which is difficult with FireAlarm seems much more difficult on Smokey's scale.

quartata commented 7 years ago

I'm in favor of this, but it probably wouldn't hurt to attempt this on NG first and then try to move it over to classic SmokeDetector.

Undo1 commented 7 years ago

The main thing is synchronization between instances for things like privileges. Might be hard to come up with a way to pull that off without just committing them to GitHub.

ArtOfCode- commented 7 years ago

metasmoke could support synchronization with some additions to the API and a new model - I came up with an outline of how that would work earlier, and it wouldn't be that difficult

Undo1 commented 7 years ago

Just needs to be tolerant of when metasmoke disappears because its negligent owner is gone for eight hours. You'll probably also need to handle conflicts in some way; if metasmoke disappears and you have instances switch during that time you'll have diverging changes.

NobodyNada commented 7 years ago

I know you're probably sick of me suggesting this...but would Redunda work for that? IMO it'd work well for the existing pickles, and it may be good enough for privileges too. It should work just fine if we just sync the sqlite database file...shouldn't it?

quartata commented 7 years ago

I'd rather not synchronize the whole giant file.

Undo1 commented 7 years ago

Same set of issues to address as outlined above - it can disappear (and will at the same times metasmoke does), and then you can have conflicts.

quartata commented 7 years ago

@Undo1 I believe if Metasmoke disappeared the easiest solution would be just to pull Metasmoke's master copy when it comes back.

ArtOfCode- commented 7 years ago

We can design to be tolerant of downtime by caching on disk (which a database does anyway - obviously, it's a local database), and if metasmoke is down when we want to switch instances... shrug. It ain't going to be that common.

Undo1 commented 7 years ago

Yeah, you just have to make a conscious decision to always clobber local changes with metasmoke changes. Not sure how to do that without syncing the whole database file (which shouldn't be that big, really. S3/Paperclip is awesome for that kind of stuff).

quartata commented 7 years ago

Syncing the whole database file in the exceptional case would not be a big deal.

Undo1 commented 7 years ago

We'd need a way to detect that exceptional case. May require building a whole change tracking system, or possibly could be handled with a couple of timestamps.

quartata commented 7 years ago

Surely Smokey would know when MS went down; the sockets would close.

ArtOfCode- commented 7 years ago

I get the feeling we're bikeshedding here. I don't see any major objections to using a database so far that can't be solved in the implementation?

NobodyNada commented 7 years ago

I'm not sure that conflicts are a huge problem; if MS goes down, we're back to what we have now as far as synchronization.

Undo1 commented 7 years ago

That's true. I don't have any major objections, presuming it'll be painted orange.

ArtOfCode- commented 7 years ago

Orange? But surely dark teal is the best colour it could be?

angussidney commented 7 years ago

SQLite sounds like a good idea. No additional setup required which means that there won't be any difficulty setting it up on Windows etc for testing.

As mentioned above however, we can't rely on MS to be up at all times. We really need to decide which things we need to have up to date on all instances, and which things are simply nice-to-haves. For instance, privileges need to be synced at all time, but notifications are simply nice-to-haves. We may need to use separate database files/separate syncing techniques for these.

And I agree with @Undo1 - it has to be painted orange.

AWegnerGitHub commented 7 years ago

I think we are in agreement that utilizing SQLite to locally cache config data is appropriate.

I disagree with @angussidney on one point: When we build the synchronization, I think ever aspect should synchronize the same way and to the same location. This keeps code management easy and reduces confusion (why did X sync but Y didn't?)

The discussion about how to synchronize and that architecture should take place in another issue.

ArtOfCode- commented 7 years ago

Query: assuming we don't store any sensitive information in this database, is there any reason we can't commit it to the repo on some sort of trigger (like an !!/upload-data command, perhaps)?

NobodyNada commented 7 years ago

I think automatic synchronization is a better option, simply because it's automatic (and we can't forget to sync if it's automatic). With manual synchronization, if one instance dies and MS starts a failover, we'll end up with a conflict.

I also have thought about conflicts with FireAlarm. During normal operation, there won't be two instances running simultaneously for more than 30 seconds, so conflicts aren't an issue. If I'm going to run a development copy simultaneously, I just rename redunda_key.txt to disable synchronization and prevent the prod instance from going to standby.

AWegnerGitHub commented 7 years ago

@ArtOfCode- If we are using the database for configuration, I assume that also means things like "location", whether or not to run in CHQ only mode and other specific configuration options. In that case, we don't want to use the whole DB because those specific fields are for individual instances.

quartata commented 7 years ago

I am against using the database for config. It is not human readable or easily editable nor is the configure something that should be shared across instances.

teward commented 7 years ago

I would rather have a DB for storing all the information that the Pickles hold.

I am, however, against storing the actual SmokeDetector configuration data that the person running Smokey would need to change in a database. Since the location and MS API keys will vary instance to instance it makes sense to only store that data in a per-instance configuration file, with certain hard-coded items being in the DB or in the code itself (such as central 'sync' locations, etc)

AWegnerGitHub commented 7 years ago

I posted this in CHQ, but I'll repost it here:

Everything we are talking about storing is configuration data. Blacklists, patterns, privileges, all of it is telling Smokey how to run. Without it, the code can't do anything. Adding the "configuration" settings that are instance specific should be stored with the rest of the configuration data.

Changing the location of an instance or adding a key shouldn't require a code change. I should be able to open up either the database (or another way of modifying settings), make the change and it work (maybe after a restart or way to refresh settings)

The concern everyone is having is around synchronizing instances. This is valid, but is out one step to far. Focus on a single instance first. Pull as much "configuration" out of the code as you can. The next step - keeping the instances on the same page - needs to happen, but it needs to happen at a "how do we design this to be robust and scalable" level, not a "I want this config value to always be updated this config value to only update every 15 minutes and this one can update whenever" level.

ferrybig commented 7 years ago

We also have to consider corruption of the setting files

With flat files, we can write the updated data to a new file, then rename it the old one, and rename the new file back in the original place, this allows us to always have a guarantee that 1 of those files will be non-corrupt (try to load the original file, and if that fails load the old file).

With databases, we should choose a proper backend that tries to prevent corruption in most cases, and most databases won't use the multiple file approach to handle the possibility of a write being interrupted mid write

ArtOfCode- commented 7 years ago

@teward, @quartata - in reality, how much do those config values change? On a single machine, nothing but standby or CHQ only mode changes, and we can keep the command line arguments for enabling those. Other than that, it's still only two commands to update a config value, and you're not going to be doing that often. I think that's a valid trade off for consistency.

angussidney commented 7 years ago

@ArtOfCode- @AWegnerGitHub I really don't see the point of using a database for config data. It doesn't make sense - databases should store data that is read and updated by machines, not humans.

How many other projects use a database for human-editable configuration? When you want to change the secret token for GH on metasmoke, you don't have to rails console, load the database and edit it - you simply edit the config file. Using a database for config info isn't 'consistent' when it's a weird and annoying thing to do. I think Undo's comments on this pretty much sum up what I think (emphasis mine):

Here are my priorities for most any new change: (1) is this a real solution to a material problem, and (2) is this the simplest way to solve the problem, including factors like resource use and ease of setup. When setting up a Smokey instance now includes SQL anything, that's a whole new level of resource requirements. Both technically and human-ly.

Then what happens if we decide to start syncing databases? Depending on how we design it, that would mean location and other instance-specific data gets shared when we don't want it to, which is bad.

Yes, we should be taking stuff like privileges, blacklists, etc out of the code and into a database. But that doesn't mean that we should be taking stuff that wasn't already in the code - configuration files - and putting that into a database.

ArtOfCode- commented 7 years ago

So I'm gonna go with "undecided" on whether we should put config in the database or not. In one way, it's more consistent across the Smokey project, because everything is in a database. In another, angus is right that it's not the done thing.

@angussidney Andy's screenshot was of his comment-flagging app, which does use a database for config. As we do for autoflagging config on metasmoke, actually - that's the FlagSettings table.

angussidney commented 7 years ago

So we've all decided that we want to use a database for the information which we currently use pickels, as well as blacklists, privileges, etc?

As you say, the only thing remaining undecided is whether or not we use the database for config. I think that should probably discussed separately in another GH issue.

magisch commented 7 years ago

I'm in favor of using a database. Benefits include:

Easy to setup and to port. As @ArtOfCode- mentioned, SQLite comes easily installed and in packages and with good ORM
Uniform way of data storage and access in such a way that can easily be collated and manipulated for graphs and other uses
Databases solve exactly this kind of problem, so I don't see any performance hit.
The entire solution is cleaner and more well-differentiated from the code
It makes it vastly easier to carry blacklists/permissions/etc from instance to instance.

As for the config, I think config should be a file.

AWegnerGitHub commented 7 years ago

How many other projects use a database for human-editable configuration?

I'm really not sure what type of response you are looking for on this. I can point to almost every application I maintain, support or use at work as using a database. I can point to countless vendor applications that use a database. I can point to my own projects that use a database. All of these store both transactional data and configuration data. The idea is that when users or administrators want to change a setting, they do so in a single place.

What is being proposed with the database/"config" split is complicating the solution. Data is now stored in multiple places. Do you go to the flatfiles or the database to solve an issue? Using Undo's quote, I can emphasize the exact same areas and argue that putting everything in the database is much easier.

Ease of set up: Everything is in one spot. It's in the database, and probably in a table named "Settings".
Both technically and human-ly: As a developer, I expect you to be able to read an SQLite database. As a python developer, you already have the tools to do so, since it's built in to the language. If you can't, then we have an issue on the "technically" side of things, not the "human-ly" side. However, if we are concerned that someone can't open the database, we have a couple easy solutions for that too.
- Option 1: Make a setup script that updates the database with appropriate values. Ask the user for a location, MS keys, etc and then put those in the database for them. The user never even needs to open the database and it follows the same installation process of many other applications (which ask what features you want to install, where to install it, etc)
- Option 2: Make a small local dashboard. This adds complexity because it'd require a temporary web server, but it's an option.

Depending on how we design it, that would mean location and other instance-specific data gets shared when we don't want it to, which is bad.

You are arguing against a solution because of a design problem we haven't even made yet.

You're right. We need to design the solution correctly. But, a lack of a design does not mean we shouldn't put all of the configuration data into the database.

quartata commented 7 years ago

Interactive config tools are awful.

Furthermore, the point of everything that's going into the database so far is that it's internal data shared across instances that doesn't need to be edited by a human. Config is literally none of those things.

quartata commented 7 years ago

Additionally, there are no "code changes" requred for config modifications. NG stores config in a JSON file.

quartata commented 7 years ago

(and classic in an INI file)

quartata commented 7 years ago

@ArtOfCode- But it's making it inconsistent. As I pointed out above a config table would be completely different from everything else in the DB and would need special treatment (at the minimum no syncing which is a massive pain for when new instances come online)

AWegnerGitHub commented 7 years ago

Pulling in comments made in CHQ

1.) Configuration is not code. It shouldn't require a change to a file to adjust the location an instance reports it's running in. It shouldn't require a code change to add an MS key. Just like it shouldn't require a change to blacklist, add a user. Those are config values too.

2.) Splitting between a DB and a config file complicates maintenance. Does a new thing go in the config or the database? Why'd you put it there and not the other place?

Everyone is getting hung up on step two in this convert to a database - how to keep everything in sync. That's a valid concern, but it's not a reason to keep things out of the database.

If you are against putting things like location, MS key, CHQ only mode in the database, why do you want to put blacklists and privileges in the database? What is the difference between the two other than one setting returns a single value and the other setting returns a list of values?

[Local vs global and keeping instances in sync], to me, seems to be what everyone is getting stuck on. They like the idea of the transactional data (black lists, privs, etc) in the database because then they can simply copy an SQLite DB around. But...that can't occur if instance specific settings also exist. So, the final design needs to account for both global settings and instance settings.

ArtOfCode- commented 7 years ago

Okay. I'm closing this with a resolution of "yes, use a database", with further implementation details to be worked out. (That doesn't preclude further discussions here, of course, it's just to show that the original question has been answered.)

ArtOfCode- commented 7 years ago

As far as config goes, Andy's convinced me - I think we should use the database. It's trivial to write a script to dump some tables but not others and output to SQL, the results of which can be shared or uploaded to Redunda or Drive or S3 or Dropbox and used to sync, without causing issues with instance settings.

NobodyNada commented 7 years ago

IMO we should not be storing per-instance configuration in the database.

The configuration data isn't really improved by the relational model. If we stored config in the database, we'd most likely end up with a "configuration" table containing only "key" and "value" columns. While a relational database would be able to store that....we don't really gain anything.
The configuration data is going to be edited solely by humans. Sure, we can edit it if it's in a database, but plain text is significantly easier.
Synchronization. Config is per-instance, while the pickles are global. If we store config in the database, we'll have to worry about synchronizing the global data but not the config data. If we only store the global settings, we can simply sync the .sqlite3 file.

While these aren't huge issues, they make our current text storage for per-instance configuration easier to work with. We can use a database for configuration, but I can't see any advantages of doing so. Simple is better, unless there's a reason to make it complex.

ArtOfCode- commented 7 years ago

Okay, re: storing config in database, both sides have excellent arguments. Either we put it to a vote, or someone makes an executive decision, and I'm thinking a vote is gonna garner less animosity.

Thumbs up on this comment for using the database for config. Thumbs down for continuing to use flat files/ini files/json files.

Undo1 commented 7 years ago

After reviewing stuff, I vote to use a database for everything except instance configuration options. @NobodyNada's Synchronization point is good. I think (hope?) we take the simplest path and just sync the whole .sqlite file. Of course it'd be possible to create and maintain tools to manage config in the database, but that's dev time we could use for other things (-ng, perhaps).

Charcoal-SE / SmokeDetector

Yet Another Debate: Should we use a database instead of flat-files? #657