Database backend - Githubissues

grillazz commented 7 years ago

@zen , @lechup , @ajakubo1 Please let me know which database backed you prefer. I want to add it as separate docker container ?

lechup commented 7 years ago

I think we should go with Postgis. We will have some data binded to coordinates and probably would want to add sort of nearest_sensor endpoint?

And yes definately it should go to second service in our docker-compose file.

PS: I've successfully tested https://hub.docker.com/r/mdillon/postgis/

grillazz commented 7 years ago

I also put my vote for postgress with postgis but would like to know what thinking all parties.

ajakubo1 commented 7 years ago

oh, neat!

I add my vote to it :).

Is there a postgres plugin, which handles time-series data efficiently, or does postgres in itself does it properly? I presume we'll have more problems with time-series data, then geolocation data.

lechup commented 7 years ago

@ajakubo1 good point, I've started looking around in google on the topic and found: http://stackoverflow.com/a/4884384/479931

Which basically means that postgis is enought - for now.

I would postpone any furher optimization for time when we encounter bottlenecks.

@zen do we have idea how long we would like to store data and in what resolution?

PS: I do not have any experience with time series data. @grillazz ? Any ideas?

ajakubo1 commented 7 years ago

I did suggest influxdb for storing time-data :). But I do agree it's an overkill at this stage. I'm also pretty sure that postgres should be enough.

For data resolution - in case we want to have additional tables where we put data aggregates for larger time periods, I would suggest to do it with: minimum, maximum, mean and median - this should be enough to extract (and draw) all of the crucial information.

zen commented 7 years ago

Considering that we care a lot about simplicity of installation, I would avoid even PGSQL if possible, sticking to MySQL for individual installations. I don't want people to struggle with PGSQL administration. If we are talking about public instance, we could use PGSQL and Influx for time series data. Although I'm not sure if this is easily implementable to have a choice.

For time series data we need to keep couple of years of data. All official metrics are using day averages, and we will need this for calibration as well. This would suggest that 24h average would be enough. But in the future I would also like to introduce condition prediction based on historical data. Those predictions will need atmospheric data and air quality data with higher resolution. I think 1h average would be ok, assuming that default reporting will be done every 5 minutes.

So, to sum up, we would need:

real data for last 30 days
1h average for historical data

Does this make sense?

lechup commented 7 years ago

I would suggest sticking to one DB engine (I know that Django has an ORM but from my experience if we decide to support more than one engine we end up with conditional ORM code in some edge cases).

@zen what do You mean by I don't want people to struggle with PGSQL administration. exactly?

My idea how to tackle it is to prepare ansible provisioning scripts for one or more hosting service + step by step how to.

My pick would be DigitalOcean as a default but I'm open.

Regarding resolution and splitting data into historical and real - I think we need to redesign some models @grillazz ?

My first idea is to create background job (celery?) that would be run each hour and move real data into historical bucket but maybe there is different and more efficient way to do this? @grillazz @ajakubo1 ?

zen commented 7 years ago

So, just to explain some assumptions. We will provide global site for everyone to use (air-monitor.org) but also a choice to install entire system on their own. I would prefer all sensor users to report data to central system (so more people can benefit from it). I think we can encourage people to do so if we provide very flexible options to maintain their own, customised project backends, even under their own domains. By project I understand local, anti-smog group, that can run local events, promotion, educational stuff, etc. This way we can limit separate installations thus not to care about database administration and provisioning scripts.

Re celery job - this is how I usually do it, I prefer it over stored procedures.

grillazz commented 7 years ago

With MySQL I can see only one issue. It is no longer open source once together with sun was bought by Oracle. As first step Oracle removed Uni tests from MySQL source code. Also author of MySQL is no longer supporting this project and started with MariaDB which is open source copy of MySQL. For other side administrative impact onto PostgreSQL is qual to zero for this kind of project.

zen commented 7 years ago

We still have Galera and other forks. Actually, I stopped using pure MySQL in favor of Galera couple of yeas ago. I think it should be fine, considering 100% compatibility

zen commented 7 years ago

But I'm fine for PGSQL if that's your preference

grillazz commented 7 years ago

Yup Im voting for PostgreSQL

lechup commented 7 years ago

Just for keeping things clear, we had similar discussion regarding PGSQL/self hosting solution on Frontend repo, @grillazz make sure You are aware of this idea:

https://github.com/EnviroMonitor/EnviroMonitorFrontend/issues/5#issuecomment-269958543

And You are fine with that.

lechup commented 7 years ago

In #23 I've introduced postgis.

lechup commented 7 years ago

Closed by #23

EnviroMonitor / EnviroMonitorWeb

Database backend #15