Closed ranchodeluxe closed 4 months ago
The way Quinlan wrote this the data is downloaded during task kickoff on the host. So we should rewrite as I mentioned in this slack thread: https://developmentseed.slack.com/archives/C04PY5QRHCM/p1698890912079009
Needs to be rewritten to be lazy anyhow like this: https://github.com/ranchodeluxe/mursst-example/blob/main/feedstock/recipe.py
Local runs fail and the hunch is that the parent process is being killed during rechunking b/c it's taking too much memory. Running on Flink big box to gauge $
Flink run fails here and it seems there's something wrong with the data/logic: https://github.com/NASA-IMPACT/veda-pforge-job-runner/actions/runs/7469136161
@ranchodeluxe should I create a new issue in this repo for the kerchunk recipe? I modified the kerchunk recipe in https://github.com/pangeo-forge/staged-recipes/pull/259 to use FilePattern
in place of GranuleQuery
and also the WriteCombinedReference from https://github.com/pangeo-forge/pangeo-forge-recipes/pull/660. It works locally for 30 timesteps.
However, it is working locally because I am using HTTPS so I think I would like to try running it on Flink so it can use direct S3 access. Right now, the protocol is hard-coded as HTTPS but I am thinking we should make it configurable as either an environment variable or (I think more ideal) part of the configuration. Do you know if there is any documentation on how to pass configuration parameters, either via the command line or config file?
@ranchodeluxe should I create a new issue in this repo for the kerchunk recipe? I modified the kerchunk recipe in pangeo-forge/staged-recipes#259 to use
FilePattern
in place ofGranuleQuery
and also the WriteCombinedReference from pangeo-forge/pangeo-forge-recipes#660. It works locally for 30 timesteps.
Nice! You could just edit this existing one above and point the examples to the right repo
and ref
and that's about the same thing. But your call
However, it is working locally because I am using HTTPS so I think I would like to try running it on Flink so it can use direct S3 access. Right now, the protocol is hard-coded as HTTPS but I am thinking we should make it configurable as either an environment variable or (I think more ideal) part of the configuration. Do you know if there is any documentation on how to pass configuration parameters, either via the command line or config file?
For the time being we can just add a global input here and add it as an os env var here (like the secrets are being handled for now). Then you recipe can conditionally check if os.environ.get('WHATEVER')
. Let me know if that makes sense. There still might be a use case for a traitlet option to pump arbitrary key/values per recipe into the os environment but we can sip on that idea for now like a fine wine
Thanks @ranchodeluxe I:
LocalDirectBakery
for all timesteps". Do we need to include this test if its working for a few files, I feel like it makes sense to move to flink at that point. Perhaps testing more than 2 files makes sense, but testing with the entire archive seems like that is what flink is designed to handle.As noted in the issue text, I think we need to merge #25 and #28 before we can run this with the flink cluster hooked up to this repo, is that right?
As noted in the issue text, I think we need to merge #25 and #28 before we can run this with the flink cluster hooked up to this repo, is that right?
Yep, merged these blocking PRs.
1. Updated the issue above to include the local run and future flink run for kerchunk. One question I have is does it make sense to include "Runs on `LocalDirectBakery` for all timesteps". Do we need to include this test if its working for a few files, I feel like it makes sense to move to flink at that point. Perhaps testing more than 2 files makes sense, but testing with the entire archive seems like that is what flink is designed to handle.
I mean you're right that we shouldn't need to be running all timesteps on the LocalDirectBakery
. The reason I had that listed is b/c this whole stack is so new/flakey and LocalDirectBakery
is really the only way to get accurate feedback about whether something is working or not. I'll probably keep running things in full on there (EC2 machine with 8 cores so it's real fast 😉) until I know that the current blockers (below) to ETL'ing full datasets are resolved. The first one is the most important blocker:
https://github.com/pangeo-forge/pangeo-forge-recipes/issues/667
pangeo-forge-recipes
only has the changes we need on main
and a release needs to be cut for these changes. That cut should also include this pending bug fix IMHO
https://github.com/NASA-IMPACT/veda-pforge-job-runner/issues/27
Thanks @ranchodeluxe
this whole stack is so new/flakey and LocalDirectBakery is really the only way to get accurate feedback about whether something is working or not
that makes sense and seems like something we need to resolve to make flink useful, otherwise why wouldn't we just use the output from "tests" for all timesteps on EC2? (Reuse and reproducibility of course). Thanks for opening https://github.com/NASA-IMPACT/veda-pforge-job-runner/issues/27 to address this.
otherwise why wouldn't we just use the output from "tests" for all timesteps on EC2?
@abarciauskas-bgse: that's the backup plan 😉 Maybe we use AWS Batch. And when I say the "whole stack being new/flakey" I'm also talking about the pangeo-forge-recipes
not just the Flink runner. That said we still haven't seen anything fail on Flink that runs locally so as long as we get good error feedback there's nothing wrong with Flink at this moment
I did some testing on EC2 to try and figure out, if we did use a local runner, what type of resources we might need to generate the kerchunk reference in a reasonable amount of time and without error.
sudo apt-get update
sudo apt-get install python3-pip
python3 -m pip install --upgrade pip
pip install \
fsspec \
s3fs \
boto3 \
requests \
apache-beam==2.52.0 \
pangeo-forge-runner>='0.9.1' \
git+https://github.com/pangeo-forge/pangeo-forge-recipes.git@main
export PATH=$PATH:/home/ubuntu/.local/bin
# vi local-runner-config.py
c.Bake.bakery_class = "pangeo_forge_runner.bakery.local.LocalDirectBakery"
c.MetadataCacheStorage.fsspec_class = "fsspec.implementations.local.LocalFileSystem"
# Metadata cache should be per `{{job_name}}`, as kwargs changing can change metadata
c.MetadataCacheStorage.root_path = "./metadata"
c.TargetStorage.fsspec_class = "fsspec.implementations.local.LocalFileSystem"
c.TargetStorage.root_path = "./target"
c.InputCacheStorage.fsspec_class = "fsspec.implementations.local.LocalFileSystem"
c.InputCacheStorage.root_path = "./cache"
Modify the recipe in github for different temporal ranges in order to evaluate the duration of the recipe run and size of the output.
git clone https://github.com/developmentseed/pangeo-forge-staging
cd pangeo-forge-staging
git checkout mursst-kerchunk
cd ..
vi pangeo-forge-staging/recipes/mursst/recipe.py
export EDU=aimeeb
export EDP="XXX"
time EARTHDATA_USERNAME=$EDU EARTHDATA_PASSWORD=$EDP PROTOCOL=s3 pangeo-forge-runner bake \
--repo=./pangeo-forge-staging \
--ref="mursst-kerchunk" \
-f local-runner-config.py \
--feedstock-subdir="recipes/mursst" \
--Bake.recipe_id=MUR-JPL-L4-GLOB-v4.1 --Bake.job_name=local_test
notes:
days | time (seconds) | size (mb) |
---|---|---|
30 | 71 | 22 |
61 | 145 | 42 |
91 | 238 | 62 |
122 | 334 | "Channel closed prematurely" Error (see full traceback below) |
days | time (seconds) | size (mb) |
---|---|---|
30 | 37 | 14 |
61 | 68 | 25 |
91 | 101 | 34 |
122 | 137 | 44 |
153 | 170 | 54 |
183 | 161 | 64 |
days | time (seconds) | size (mb) |
---|---|---|
30 | 27 | 14 |
61 | 47 | 19 |
91 | 64 | 24 |
122 | 77 | 29 |
153 | 92 | 35 |
183 | 118 | 39 |
Every month adds at least 5mb size to reference file data. 70mb per year or 1400mb for 20 years.
I saw quite a few of these:
/home/ubuntu/.local/lib/python3.10/site-packages/kerchunk/combine.py:269: UserWarning: Concatenated coordinate 'time' contains less than expectednumber of values across the datasets: [676285200]
which I'm concerned has to do with the different sizes
Metadata does not appear to be consolidated when I open the dataset
@ranchodeluxe @norlandrhagen last Friday I mentioned I was running into this error: https://github.com/fsspec/kerchunk/blob/063684618c053e93e3f1f25c4688ec2765c0d962/kerchunk/combine.py#L501-L506
It does appear there are a few days in 2023 where the mur sst netcdf data is chunked differently than all the other days. I started to go down a 🐇 🕳️ of kerchunk and pangeo forge recipes, which was a goose chase, because I could actually see the different chunk shapes if I just updated my version of xarray or used h5py or h5netcdf (see https://github.com/pydata/xarray/issues/8691).
unfortunately, the different chunk shape is not an issue for just a few files. All of the data has chunk shape (1, 1023, 2047) except the following date ranges:
My understanding is we cannot create kerchunk references for data with variable chunk shapes, which I think is the reason for the Variable Chunking Zarr Enhancement Proposal (ZEP).
In lieu of support for variable chunking in Zarr, there are 2 resolutions I can think of:
Curious what you think, and also @sharkinsspatial
Thanks for the investigation @abarciauskas-bgse! Happy to chat about this tmrrw.
I wonder if there is a 3. of reaching out to the data provider and see if we can get any clarification on why this is happening.
@norlandrhagen Totally agree, I was thinking to reach out to po.daac to see if there are plans to backprocess to complete the new chunk shape across the whole dataset or otherwise deliver a consistently chunked version.
Seems like that would be the best way forward, but having the variable chunking ZEP in place would be super helpful for cases like this.
To recap from our meeting today, I think the next steps will be:
kerchunk: https://github.com/developmentseed/pangeo-forge-staging/tree/mursst-kerchunk
Runs on
LocalDirectBakery
using prune optionLocalDirectBakery
using prune optionlocal-runner-cofig.py at bottom
LocalDirectBakery
using prune optionRuns on
LocalDirectBakery
for all timestepsLocalDirectBakery
for all timestepssame as above without
--prune
LocalDirectBakery
for all timestepsRuns on
FlinkOperatorBakery
for prune optionFlinkOperatorBakery
for prune optionThis has not been attempted yet because it depends on https://github.com/NASA-IMPACT/veda-pforge-job-runner/pull/28 (for the protocol parameter) and https://github.com/NASA-IMPACT/veda-pforge-job-runner/pull/25 (for passing Earthdata username and password)
FlinkOperatorBakery
for prune optionRuns on
FlinkOperatorBakery
for all timestepsNot yet tested
FlinkOperatorBakery
for all timestepslocal-runner-config.py