Figure out Most Cost Effective Way to Retain Data > 30 Days

keenanjohnson commented 2 years ago

Currently, Ribbit Network uses the free tier of the InfluxDB Cloud service. This has the limitation that we can only retain data for 30 days, which obviously should be improved for the usefulness of the network.

keenanjohnson commented 2 years ago

@spestana had started some thinking on this in Discord:

"I am aware of some organizations that will host "research data" for free (up to some limit) in my field (hydrology/snow/ice research), but not familiar with anything similar for air quality or climate, though I'm sure there are similar organizations out there. I emailed some of the air quality and atmospheric science folks here at UW asking about that and they said they're going to ask around about it. Then there are organizations like: https://www.opensciencedatacloud.org/ which you can apply to to host your project's data. Alternatively, I know that Amazon/Microsoft/Google have all provided cloud compute and storage for various science projects (either free or lower cost) when special agreements were made. I tried to estimate data volumes needed, and it looks like ~0.3MB/day/sensor so ~110MB per sensor per year. Which isn't really all that much."

abhinavtripathy commented 2 years ago

@keenanjohnson I was wondering if firebase would be a viable option, they allow for database store of 1GB and plenty of reads and writes. At some point we have to figure out how to archive data, keeping all this data with a lot of sensors can increase fairly quickly. For now our needs could fit with firebase and then we can build functions to archive data before the data increases beyond that threshold and maybe store it as a csv or whatever format we think is necessary. Though I do think we should have an easily accessible format for it. https://firebase.google.com/pricing

keenanjohnson commented 2 years ago

It's possible and I don't think anyone has investigated Firebase! I think that what we should do is compare all the available options including firebase, influxDB etc with pros and cons. Is that something you can take the lead on @abhinavtripathy?

abhinavtripathy commented 2 years ago

Happy to take lead on this!

keenanjohnson commented 2 years ago

Sounds good! @spestana had been looking into a little bit as well, so make sure to keep this issue up to date and stay synchronous with each other :).

@spestana if you had wanted to take the lead as well, I'm sure we can find other work for @abhinavtripathy , so just let us know :).

spestana commented 2 years ago

Go for it @abhinavtripathy !

There were a couple of research-oriented data stores I'd wanted to look into, (Princeton has a guide here, and Mendeley ), otherwise I was thinking of the open data services of big cloud providers too ( google, aws, azure).

abhinavtripathy commented 2 years ago

Thank you @spestana @keenanjohnson, I will be working on this from tomorrow and seeing what works best and coming up with a report.

SambhaviPD commented 2 years ago

@abhinavtripathy - I recently joined this repo, and started to look at the open issues. Just wanted to know whether there are any updates on comparison between databases. If you need an extra hand, I'm ready to jump in.

abhinavtripathy commented 2 years ago

@SambhaviPD all help is welcome on this! I am currently tackling some other issues but it would be great if you can do further research. We were exploring if influxDB can give us a free membership/free data storage as a non profit. But I think we should have other options on hand! Please do the research and add your comments/thoughts on this thread! Thank you! Also feel free to consult/reach out to me if you get stuck or need help! I am available here or via discord(abhinavtripathy)

SambhaviPD commented 2 years ago

Sure @abhinavtripathy, I shall start to take a look at this one. I'll share my findings once it gets to a shape.

SambhaviPD commented 2 years ago

@abhinavtripathy , @keenanjohnson - I need some quick help in understanding a few things. I started to look at various options, and captured notes on and off in my notion page.

I shortlisted a few open source time series databases based on quick reading, and then started to look at it from our use case point of view.

Where I need some help is that, today how is the senor data written to InfluxDB? I was reading few articles in general to understand how these things work.. Fyi, I have not gone through the frog sensor code since I initially thought this one is logically independent, and I was unsure as to how much of hardware code I would understand :). On second thoughts, should I be doing that first? Thought, let me check with you first.

Also, the ones that I shortlisted, would help if you can give some feedback in terms of whether they are atleast fit to explore further. Wouldn't want to be spending time if it's in the wrong direction. Thank you!

keenanjohnson commented 2 years ago

Hey @SambhaviPD ! We use the official Python client for Influx DB. You can see the python source code of the Sensor here: https://github.com/Ribbit-Network/ribbit-network-frog-sensor/blob/main/software/co2/co2.py

There is definitely a goal to abstract that via an API later: #64, but that's not implemented yet.

Thanks for your help! Keep the questions coming! I think your Notion page is a good start! Perhaps you could move it into a Github discussion thread or in this issue so others can contribute?

One important aspect is what tool or service will be the most cost efficient for us to retain and use that data, so understanding things like cost per read/write, cost per storage, etc are critical as you continue your exploration.

We're a non-profit, open-source project, so spending money as frugally as we can on tools like this is quite important.

SambhaviPD commented 2 years ago

Hey @keenanjohnson , Thanks for the reply, Sure, I shall take a look at co2.py.

And yes, I'm majorly looking for open source options only. Completely understand and accept your point on being frugal where ever we can.

I changed the properties of my notion page to allow editing and allow comments. Now all can contribute. Is that sufficient, or is there any other option where I can make it more easy for everyone to see and contribute?

I shall continue my exploration of databases, and update in a couple of days.

SambhaviPD commented 2 years ago

@keenanjohnson , I went through co2.py. Now I'm pretty clear when it comes to writing to the database part in Ribbit flow.

Fyki, I did some more reading on two more DBs on my list, and updated my notion page

Yet to cover two more db before I share my final suggestion.

One question is, since I do not have the sensor, if I go ahead and change the logic in co2.py to read from another db (after I complete my research) and not Influx, how do I test? Thoughts/Suggestions?

keenanjohnson commented 2 years ago

Hey @SambhaviPD ! Thanks for the research! Probably the most effective way to test would be to create a "mock" sensor in software just allows you test the db connections.

I don't see any considerations for cost yet in your notion page. Is that something you still plan on considering?

Thanks!

fosteman commented 2 years ago

Hey @SambhaviPD ! We use the official Python client for Influx DB. You can see the python source code of the Sensor here: https://github.com/Ribbit-Network/ribbit-network-frog-sensor/blob/main/software/co2/co2.py

Should we consider non-open source solutions that are practically free ? Like https://firebase.google.com/docs/database/

@abhinavtripathy

I was wondering if firebase would be a viable option

Firestore does not have a very well developed read queries syntax - it's very simplistic. And they are charging per transaction, so everytime somebody reads or writes - that is billed (after a certain point, the pricing is very allowing for startups and small projects like ours).

However, the Real-time DB offered by Firebase is something to look into.

fosteman commented 2 years ago

@spestana ,

otherwise I was thinking of the open data services of big cloud providers too ( [google](https://cloud.google.com/life-> sciences/docs/resources/public-datasets), aws, azure).

Yeah, GCP offers $300 of credits for new accounts. We can make use of them.

SambhaviPD commented 2 years ago

@keenanjohnson - Ok, I shall try to create a mock in software to test the db connection part. On cost, the databases I'm researching are all open source ones only, no cost factor involved.

Hence the pointers I'm looking at are, more from a easy learning curve, good community support, easy deployment, good user base, shouldn't be an overkill - such aspects. I have 3 more DBs to cover in my list, once done I shall share my recommendation. We just need to do the math at which data will grow, to make sure the VM/droplet that's there in Heroku will be able to hold it. I presume cost will start to come only from that end once data starts to grow.

@fosteman - Exactly, that's what I'm looking at. Here is my notion page where I'm capturing notes on the databases that I'm exploring. Shall keep this thread updated with my findings soon.

fosteman commented 2 years ago

@SambhaviPD ,

I've created a serverless firebase function to serve new frontend here =>

If you were to mock anything up, feel free to use that node.js code https://github.com/Ribbit-Network/ribbit-network-dashboard/pull/94/files#diff-e6a35fcf3c90f75b27c14eb559ddcf8cd259b4e1a2cf49f73d3ed9f91229fccd

fosteman commented 2 years ago

Left column - free: https://firebase.google.com/pricing?authuser=0&hl=en

SambhaviPD commented 2 years ago

@keenanjohnson, @abhinavtripathy , @fosteman @spestana - From my reading of all 6 databases (see my notes here), IMHO we can go ahead with either Quest DB or Grid DB, as they seem to fit the bill pretty well.

Both are open source, community edition is what we can go with. Throughput depends solely on the server where the db is installed, number of cores, memory, and so on. In terms of storage, depending on how soon the node gets full, we may have to think of storage options like s3 buckets, digital ocean spaces (s3 compatible), other similar products with other cloud providers.

I shall use my Heroku account (free one), go ahead and install QuestDB, and see how it goes. Or, as @fosteman suggests, we can try GCP too, there is a free tier, configuration is quite similar or slightly better than Heroku. A word of caution, navigation through the GCP portal itself is not too straight forward :)

Thoughts/Suggestions/Ideas?

fosteman commented 2 years ago

@SambhaviPD , good stuff.

I would actually suggest to go with GridDB, it's noSQL and has a decent node.js, python libraries. noSQL tends to be easier to maintain in the long run. SQL constantly requires migrations and query modification every time you want to add a column - a lot of code to maintain.

Go ahead with your plan using Heroku!

SambhaviPD commented 2 years ago

@abhinavtripathy , @keenanjohnson , @fosteman - quick update, I went ahead and installed Grid DB on a GCP Compute Engine VM (the free micro one), as I don't think Heroku supports both QuestDB or GridDB as an add-on data store. Will keep this thread updated once I proceed further.

fosteman commented 2 years ago

@SambhaviPD , awesome news! -> We need to get access to Ribbit's GCP account then

abhinavtripathy commented 2 years ago

Great work @fosteman and @SambhaviPD! Sorry for the late reply, not sure how I missed out on this thread! But this is all seems like great work and we seem to be trying out ways in the right direction! Thank you all for the great work!

keenanjohnson commented 2 years ago

Yes thanks for all the hard work folks! Ribbit doesn't currently have a GCP account as we haven't run anything there before.

Sounds like the consensus here is GridDB, but can someone do a quick pricing comparison on GridDB vs InfluxDB? They seem very similiar to me in that they are both open-source, time series databases before we make a final call and start writing code?

Perhaps pick a reasonable medium-term goal of 100 sensors writing at 5-minute internals for 6 months, which is the more cost effective option so that we can ensure we are spending Ribbit Network's small funds in the most responsible way.

Thank you all!

SambhaviPD commented 2 years ago

@keenanjohnson - GCP account, for now I'll use mine as I anyways have an account there. No problems.

When you say pricing comparison, I assume you are referring to the cost associated for the cloud server where the data is stored, and not the database per se, is it not?

Let me simulate 5 minute interval for 6 months for 100 sensors in grid db, and see how much space it occupies. That will give a fair idea as to how much space it will start to take when we retain data for a longer duration. Then we can decide on the time period we'd want to retain data in the main server, and probably think of moving the rest to a s3 bucket or an equivalent (if that's required).

keenanjohnson commented 2 years ago

For the cost, I mean an estimate of the full cost require to deploy that solution for Ribbit network that scale I mentioned. What that pricing entails is solution dependent. For example, some services that are hosted like Firebase have per write costs whereas an open source tool deployed to a server would not.

Does that make sense?

On Sun, Aug 21, 2022 at 11:37 PM Sambhavi PD @.***> wrote:

@keenanjohnson https://github.com/keenanjohnson - GCP account, for now I'll use mine as I anyways have an account there. No problems.

When you say pricing comparison, I assume you are referring to the cost associated for the cloud server where the data is stored, and not the database per se, is it not?

Let me simulate 5 minute interval for 6 months for 100 sensors in grid db, and see how much space it occupies. That will give a fair idea as to how much space it will start to take when we retain data for a longer duration. Then we can decide on the time period we'd want to retain data in the main server, and probably think of moving the rest to a s3 bucket or an equivalent (if that's required).

— Reply to this email directly, view it on GitHub https://github.com/Ribbit-Network/ribbit-network-dashboard/issues/81#issuecomment-1221913058, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATQ3FXGCM3BAC7RJDETTODV2MN3PANCNFSM5U7UOXEQ . You are receiving this because you were mentioned.Message ID: @.***>

SambhaviPD commented 2 years ago

Yep, got it. In Ribbit's dashboard solution, the only cost we'll incur would be the VM where the solution + DB will be deployed. Even for VM, we can first see how much the free tier can handle.

Right now, it's on Heroku (free tier) + Influx DB (free tier).

Let me come back with some metrics this week, once I complete with a sample iot simulation for the time period we discussed earlier.

grayjones commented 1 year ago

We have upgraded to paid influxdb account which allows us to retain data indefinitely. I will close this monitor influxdb costs going forward. We are retaining data at the lowest grain so if costs become an issue we can retain aggregate data

Ribbit-Network / ribbit-network-dashboard

Figure out Most Cost Effective Way to Retain Data > 30 Days #81