Passing: GPM MERGIR - Githubissues

ranchodeluxe commented 7 months ago

https://github.com/pangeo-forge/staged-recipes/pull/260

[x] Runs on LocalDirectBakery using prune option

pangeo-forge-runner bake \
    --repo=https://github.com/developmentseed/pangeo-forge-staging \
    --ref="gpm_mergir_gcorradini" \
    --Bake.feedstock_subdir="recipes/gpm_mergir" \
    -f config.py

[x] Runs on LocalDirectBakery for all timesteps

pangeo-forge-runner bake \
    --repo=https://github.com/developmentseed/pangeo-forge-staging \
    --ref="gpm_mergir_gcorradini" \
    --Bake.feedstock_subdir="recipes/gpm_mergir" \
    -f config.py

[x] Runs on FlinkOperatorBakery for prune option

curl -X POST \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-H "Authorization: token blablah" \
https://api.github.com/repos/NASA-IMPACT/veda-pforge-job-runner/actions/workflows/job-runner.yaml/dispatches \
-d '{"ref":"main", "inputs":{"repo":"https://github.com/developmentseed/pangeo-forge-staging","ref":"gpm_mergir_gcorradini","prune":"1","feedstock_subdir": "recipes/gpm_mergir"}}'

[x] Runs on FlinkOperatorBakery for all timesteps

curl -X POST \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-H "Authorization: token blablah" \
https://api.github.com/repos/NASA-IMPACT/veda-pforge-job-runner/actions/workflows/job-runner.yaml/dispatches \
-d '{"ref":"main", "inputs":{"repo":"https://github.com/developmentseed/pangeo-forge-staging","ref":"gpm_mergir_gcorradini","prune":"0","feedstock_subdir": "recipes/gpm_mergir"}}'

ranchodeluxe commented 7 months ago

I'm able to run this locally and on Flink but here's a list of recipe problems that probably need to be fixed before we could run this for the whole ~225k archive (also applicable to https://github.com/NASA-IMPACT/veda-pforge-job-runner/issues/15):

the biggest challenge is the full archive is ~225k and with runs <=5k we often run into a whole host botocore.exceptions.ConnectionError situations. This makes sense but the pangeo-forge-recipes openers (in this case OpenWithKerchunk) do not handle errors gracefully, log them and move on. So the whole pipeline fails.
pangeo-forge-recipes only has the changes we need on main and there is not cut release for those changes yet. So that should happen
main changes (compared to 0.10.4) have a breaking change I mentioned here(we can use target_options and remote_options as a work around for now but there shouldn't be a reason we need to)

abarciauskas-bgse commented 7 months ago

@ranchodeluxe the target dataset was intended to be GPM IMERG so I have opened a new (draft) PR https://github.com/pangeo-forge/staged-recipes/pull/264 which thankfully has fewer files (8,461). Should we update this issue with that collection or close this and open a new issue?

ranchodeluxe commented 7 months ago

@ranchodeluxe the target dataset was intended to be GPM IMERG so I have opened a new (draft) PR pangeo-forge/staged-recipes#264 which thankfully has fewer files (8,461). Should we update this issue with that collection or close this and open a new issue?

Thanks for doing that 🥳 I'll update this ticket to point to your new PR branch and test it out locally and on Flink

ranchodeluxe commented 7 months ago

@abarciauskas-bgse: I was creating a PR for https://github.com/pangeo-forge/staged-recipes/pull/264 to fold in the changes on pangeo-forge-recipe that just merged. I thought I'd write a new validator/tester function to make sure the reference file that ConsolidateMetadata outputs works as expected. Here is my updated recipe

Outcomes:

we can read the reference file fine with zarr.open_conslidated, but...
xr.open_dataset(..., consolidated=True) should work too but doesn't

Filing a ticket

ranchodeluxe commented 7 months ago

@abarciauskas-bgse: I was creating a PR for [pangeo-forge/staged-recipes#264](https://github.com/pangeo- ...

I guess the good news is this mostly works

ranchodeluxe commented 7 months ago

ticket: https://github.com/pangeo-forge/pangeo-forge-recipes/issues/675

ranchodeluxe commented 7 months ago

calling this a success and changing to passing since I've run multiple years on Flink and locally to prove it works. There still seem to be holes for some datasets that result in 404(s) so we'll have to figure out where the holes are incrementally, work around them and tell folks upstream

ranchodeluxe commented 7 months ago

calling this a success and changing to passing since I've run multiple years on Flink and locally to prove it works. There still seem to be holes for some datasets that result in 404(s) so we'll have to figure out where the holes are incrementally, work around them and tell folks upstream

Seems we have holes after ~15 years of data (which makes sense based on history of IMERG and TRMM).

This run can do all 14 years in ~9 minutes with parallelism:5: https://github.com/NASA-IMPACT/veda-pforge-job-runner/actions/runs/7679542574

abarciauskas-bgse commented 7 months ago

This run can do all 14 years in ~9 minutes with parallelism:5: https://github.com/NASA-IMPACT/veda-pforge-job-runner/actions/runs/7679542574

That's awesome!!! 🥳

ranchodeluxe commented 6 months ago

@abarciauskas-bgse: profile of LocalDirectRunner.num_workers=1 on JH with this GPM IMERGE recipe. So same memory pattern locally (just not as drastic as TRMM) and possibly is able to run on Flink b/c the current per-worker resourcing is so large that it just works. More to look into. About to run LEAP (which is only StoreToZarr) and definitely will have the same pattern. I have ideas on how to isolate the issue if it is a memory leak and will create at ticket

ranchodeluxe commented 6 months ago

ticket: https://github.com/NASA-IMPACT/veda-pforge-job-runner/issues/32

NASA-IMPACT / veda-pforge-job-runner

Passing: GPM MERGIR #17