Pirate-Weather / pirateweather

Code and documentation for the Pirate Weather API
Apache License 2.0
665 stars 30 forks source link

Open Source API Code to Allow for Self Hosting #11

Open goldbattle opened 1 year ago

goldbattle commented 1 year ago

Many thanks for the project and open sourcing your processing scripts. I tried to dabble a bit myself before finding this project and only got as far as extracting and looking at the data in the grib files. I was able to run the docker with your included scripts to download the data from s3 onto my machine.

Building

cd pirateweather/wgrib2/
docker build -t wgrib2 -f Dockerfile .
docker image list

Running

docker run --net=host \
    -v "D:\\WEATHER\\:/mnt" \
    -e bucket="noaa-hrrr-bdp-pds" \
    -e download_path="/mnt/data/efs1z" \
    -e temp_path="/mnt/data/tmp" \
    -e time="2022-12-30T15:45:00Z" \
    wgrib2 \
    /mnt/pirateweather/scripts/hrrrh_combined-fargate.py

I have a couple questions.

  1. The timestamp input seems to determine the times that are downloaded, how do you normally specify this? The current time?
  2. Do I need to run all scripts? or since I am in the US, can I just run the HRRR which I believe has most data (you note on the docs it doesn't seem to have UV, but that is ok for me currently).
  3. If I need to run the other scripts, can this be done concurrently, or just sequentially?
  4. The example run command puts files inside a efs1z folder. Is this a specific folder name? or does it have some meaning here? Should the other scripts be in a different folder?
  5. I want to run the download on a cron job, could you explain how I should interpret this trigger table? Should I run HRRR every hour to get data, or just every 3 hours could suffice?

The one code I am trying to find is what maps the lat, lon to query these files. The docs say that this is a lambda function that does this and is relatively detailed on the process. Is this code public? If so could you point me in the right direction? Many thanks!

alexander0042 commented 1 year ago

Hi,

Thanks for checking out this project, and I appreciate your detailed questions here! The "open" aspect of this project is really important to me, so I'm happy to see people digging into the source, but I know this side of things could be much (much!) clearer. I'll try to address things point by point here.

  1. The "time" function is designed to be the current time as a string, using the format "%Y-%m-%dT%H:%M:%S%z". This is how AWS says when the function is run, and then the processing script does back the number of hours in that table to find the file.
  2. Nope! I call this as 4 separate step functions, just changing the run command (like you've done!).
  3. The docker image is designed to run one script at a time, but no reason you couldn't have multiple copies of the same image running.
  4. The efs1z is just my internal AWS structure coming out, so you could store it anywhere. If you're curious, the name comes from storage on the EFS file system (which is an incredibly flexible tool to get data to Lambda), set to use 1 zone.
  5. I sort of covered this in the first question, but to clarify, you want to run it on the "Ingest Times (UTC)" row and pass the current time to the container. So to run HRRR-Hourly (hrrrh), you'd set Cron to 2:30,8:30,14:30,20:30 and pass (using 2:30 as an example) "2023-01-01T02:30:00+0000".

With respect to the read script, you're right that it's not currently in this repository. There are two issues with it- it's an uncommented mess of me learning Python on the fly while building this, and relies on a ton of assumptions with respect to Lambda and AWS gateway. I think an easier solution is to ask what your ultimate goal is here and go at this from that direction, since with these scripts, all the data will be there. Something along the lines of this notebook is what I have in mind, since it's shows a python script to extract a data point time series from the NetCDF file.

goldbattle commented 1 year ago

Thanks for the response! I will revisit your answers when I get time on the weekend, but wanted to respond to your question about the processing / query scripts.

I think what I am looking for is just a function that takes in a lat, lon, and returns the json structure with all the info filled out. I am not sure if it is easy to have this repo and what you use share code, but this could separate the platform specific code and the query code.

Ideally, I want to try to create a small server that just calls this function so I can run things all on my local network, or do further processing. This is of course I was interested in contributing back.

On Tue, Jan 3, 2023 at 9:35 AM Alexander Rey @.***> wrote:

Hi,

Thanks for checking out this project, and I appreciate your detailed questions here! The "open" aspect of this project is really important to me, so I'm happy to see people digging into the source, but I know this side of things could be much (much!) clearer. I'll try to address things point by point here.

  1. The "time" function is designed to be the current time as a string, using the format "%Y-%m-%dT%H:%M:%S%z". This is how AWS says when the function is run, and then the processing script does back the number of hours in that table to find the file.
  2. Nope! I call this as 4 separate step functions, just changing the run command (like you've done!).
  3. The docker image is designed to run one script at a time, but no reason you couldn't have multiple copies of the same image running.
  4. The efs1z is just my internal AWS structure coming out, so you could store it anywhere. If you're curious, the name comes from storage on the EFS file system (which is an incredibly flexible tool to get data to Lambda), set to use 1 zone.
  5. I sort of covered this in the first question, but to clarify, you want to run it on the "Ingest Times (UTC)" row and pass the current time to the container. So to run HRRR-Hourly (hrrrh), you'd set Cron to 2:30,8:30,14:30,20:30 and pass (using 2:30 as an example) "2023-01-01T02:30:00+0000".

With respect to the read script, you're right that it's not currently in this repository. There are two issues with it- it's an uncommented mess of me learning Python on the fly while building this, and relies on a ton of assumptions with respect to Lambda and AWS gateway. I think an easier solution is to ask what your ultimate goal is here and go at this from that direction, since with these scripts, all the data will be there. Something along the lines of this notebook: https://github.com/alexander0042/Pirate-Weather-SMSL/blob/main/Pirate_HRRR_SM_Notebook.ipynb is what I have in mind, since it's shows a python script to extract a data point time series from the NetCDF file.

— Reply to this email directly, view it on GitHub https://github.com/alexander0042/pirateweather/issues/11#issuecomment-1369838994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ6TYTDL22TONSMQX262VLWQQ2L3ANCNFSM6AAAAAATN3P4WU . You are receiving this because you authored the thread.Message ID: @.***>

SoulRaven commented 1 year ago

+1 the project is interesting and i will get a spin on this and integrate in a open source project with the API written in python. a ready to go solution with my roundbox project. Is work in progress also, but the idea is to integrate anything as fast as possible and ready to deploy. Can you show more about what have you write in the backend for the api?

github-actions[bot] commented 1 year ago

There has been no activity on this issue for ninety days and unless you comment on the issue it will automatically close in seven days.

goldbattle commented 1 year ago

Please leave this open as the API to raw data scripts are still something not included in the open sourced code.

alexander0042 commented 1 year ago

Happy to leave this open for now, and it still is on the roadmap; however, the issue remains that everything is still very tightly integrated with AWS/ Lambda at the moment, so not usable outside of my specific environment. In order to speed up response times, I'm eventually migrating this to docker, so very doable down the line! I'll also caution that processing scripts download ~100GB/ day, so will require a pretty beefy internet connection to self host

fox91 commented 1 year ago

Self host on AWS is always an option 😉 Please release as is, we don't mind if it isn't optimized or if we can't run with one click. Open source doesn't mean "run easily on your device with your custom config"...

lordbagel42 commented 1 year ago

That would be exactly my thinking, I dislike subscriptions because my internet isn't the most stable, I would much rather just donate some money and then run the server on my own stuff that way if I wish to do 50,000 API requests per month, I could. I personally want it for Home Assistant.

msft-jeelpatel commented 9 months ago

Hi, any detailed guide on how to self host this and run on your own machine?

alexander0042 commented 6 months ago

Posting this here since I think it fits with this discussion, but I'm looking into what license I should use for the open-source stuff. Currently, everything is licensed under Apache 2.0; however, since the V2.0 code is pretty well all new, there's an option to take another look at this. My goal here is to make it possible to self host and run the entire stack (which will require a pretty beefy computing setup, but within the realm of possibility), but also want to avoid what happened to Redis, and have some provider come along and replicate it all without contributing back to keep improving this project. Along these lines, I'm debating releasing V2 under AGPL, and curious what people think about this?

I know it's a pretty restrictive license; however, the current status quo is not having the source public at all, which certainly isn't ideal either! The flip side of this is that I think I'll have to add a contributor license to make it possible for commercial uses of the project possible with permission. Again, definitely not ideal, but in order for free version of this to keep running, the AWS bill has to be paid somehow, so this seems like the way. I'm envisioning an min,io sort of structure- not ideal, but a practical way to make this open while keeping the lights on for the project

lordbagel42 commented 6 months ago

Personally, I’m a member of SlimeVR. We dual-license under Apache and MIT.

I dislike the GPL license for how “poisonous” it is. However, it's better than nothing, and I will support the project with either.

alexander0042 commented 4 months ago

Just wanted to say that this is still in progress, and targeting the end of the month to have the code released. I'm trying to optimize the dask ingest pipeline a bit more to get it running on machines with ~16 GB of RAM, which should be possible, but it's a lot of data, so tricky at times

mbomb007 commented 2 months ago

What if someone only needed a self-hosted instance to be able to get data for a single location? That wouldn't require downloading 100GB or as much bandwidth, right? Would it be possible to self-host and fetch data for a limited/fixed number of locations?

alexander0042 commented 2 months ago

Yea, this is something I'm thinking about. It's definitely possible, since the ingest scripts feed everything through wgrib2, which would allow for a specific grid points to be selected and saved. The issue would be restructuring the rest of the code to handle much smaller domains, since right now everything assumes constant sizes.

Zarr sparse arrays might be the answer here, since then the overall shape of the data would be the same, but only a small segment would be written.

cloneofghosts commented 1 month ago

Now that the Time Machine stuff has been sorted and things are stable now I've been thinking about this. I know you want to try an optimise the code more before releasing but you can always release as-is and then add in optimisations in afterwards.

I know another big sticking ground is what license to release the code under. While I'm not a license expert you should probably re-read https://github.com/Pirate-Weather/pirateweather/issues/14 just to make sure you comply with the licenses from any libraries/code you are using.

project-owner commented 1 month ago

Just stretching further the idea from mbomb007 - maybe create a library which could handle just one location so that it would be possible to use it even on a client side handling everything in memory without saving GBs of data? Came to this project searching for a replacement of the OpenWeather which now asks for a credit card even for a free account.

mbomb007 commented 3 weeks ago

Came to this project searching for a replacement of the OpenWeather which now asks for a credit card even for a free account.

Same. I was previously using the legacy OpenWeather API, but they "turned it off" without notifying us. Their API is also quite inaccurate with the visibility response, and it was capped at 6 miles, because they always returned it as kilometers, which were capped at 10km.

cloneofghosts commented 3 weeks ago

Same. I was previously using the legacy OpenWeather API, but they "turned it off" without notifying us.

I don't want to get too off-topic here but I got an email from OpenWeatherMap on September 16th that they would be removing access to the One Call 2.5 API on September 23rd even though it wasn't removed until the 27th. The non One Call V2.5 APIs should still work even after the shutdown but it has much less info than One Call.

I'll ping @alexander0042 to see if there's an updated timeline on when the source code can be released.

mbomb007 commented 3 weeks ago

I don't want to get too off-topic here but I got an email from OpenWeatherMap on September 16th that they would be removing access to the One Call 2.5 API on September 23rd even though it wasn't removed until the 27th.

Our One Call 2.5 API key stopped working even before that, in late August, I think.

cloneofghosts commented 3 weeks ago

Getting things back on-topic I know the license is a major sticking point for releasing the code. @alexander0042 I took a look at Open-Meteo and its licensed using AGPL and I haven't seen any issues with people contributing to the project. It also has a separate repository for the website and code in case you were curious.

alexander0042 commented 3 weeks ago

After spending way too long worrying about it, I'm pretty comfortable with just sticking with the same Apache 2.0 setup that the HA repo uses. I'm still a little worried that some corporate entity might scoop it all up and start charging for it outside of the project, but at the end of the day I'd rather have a more useful codebase that's able to be applied wherever with a bunch of community development, as opposed to a restricted one that's hard for people to add to

cloneofghosts commented 3 weeks ago

Alright that sounds good to me. So is the timeline now to hopefully have the code released by the end of the month then or will it take longer?