NASA-IMPACT / veda-pforge-job-runner

Apache Beam + EMR Serverless Job Runner for Pangeo Forge Recipes
2 stars 2 forks source link

Passing: GPM MERGIR #17

Closed ranchodeluxe closed 3 months ago

ranchodeluxe commented 7 months ago

https://github.com/pangeo-forge/staged-recipes/pull/260

pangeo-forge-runner bake \
    --repo=https://github.com/developmentseed/pangeo-forge-staging \
    --ref="gpm_mergir_gcorradini" \
    --Bake.feedstock_subdir="recipes/gpm_mergir" \
    -f config.py 
pangeo-forge-runner bake \
    --repo=https://github.com/developmentseed/pangeo-forge-staging \
    --ref="gpm_mergir_gcorradini" \
    --Bake.feedstock_subdir="recipes/gpm_mergir" \
    -f config.py 
curl -X POST \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-H "Authorization: token blablah" \
https://api.github.com/repos/NASA-IMPACT/veda-pforge-job-runner/actions/workflows/job-runner.yaml/dispatches \
-d '{"ref":"main", "inputs":{"repo":"https://github.com/developmentseed/pangeo-forge-staging","ref":"gpm_mergir_gcorradini","prune":"1","feedstock_subdir": "recipes/gpm_mergir"}}'
curl -X POST \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-H "Authorization: token blablah" \
https://api.github.com/repos/NASA-IMPACT/veda-pforge-job-runner/actions/workflows/job-runner.yaml/dispatches \
-d '{"ref":"main", "inputs":{"repo":"https://github.com/developmentseed/pangeo-forge-staging","ref":"gpm_mergir_gcorradini","prune":"0","feedstock_subdir": "recipes/gpm_mergir"}}'
ranchodeluxe commented 7 months ago

I'm able to run this locally and on Flink but here's a list of recipe problems that probably need to be fixed before we could run this for the whole ~225k archive (also applicable to https://github.com/NASA-IMPACT/veda-pforge-job-runner/issues/15):

abarciauskas-bgse commented 7 months ago

@ranchodeluxe the target dataset was intended to be GPM IMERG so I have opened a new (draft) PR https://github.com/pangeo-forge/staged-recipes/pull/264 which thankfully has fewer files (8,461). Should we update this issue with that collection or close this and open a new issue?

ranchodeluxe commented 7 months ago

@ranchodeluxe the target dataset was intended to be GPM IMERG so I have opened a new (draft) PR pangeo-forge/staged-recipes#264 which thankfully has fewer files (8,461). Should we update this issue with that collection or close this and open a new issue?

Thanks for doing that 🥳 I'll update this ticket to point to your new PR branch and test it out locally and on Flink

ranchodeluxe commented 7 months ago

@abarciauskas-bgse: I was creating a PR for https://github.com/pangeo-forge/staged-recipes/pull/264 to fold in the changes on pangeo-forge-recipe that just merged. I thought I'd write a new validator/tester function to make sure the reference file that ConsolidateMetadata outputs works as expected. Here is my updated recipe

Outcomes:

  1. we can read the reference file fine with zarr.open_conslidated, but...
  2. xr.open_dataset(..., consolidated=True) should work too but doesn't

Filing a ticket

ranchodeluxe commented 7 months ago

@abarciauskas-bgse: I was creating a PR for [pangeo-forge/staged-recipes#264](https://github.com/pangeo- ...

I guess the good news is this mostly works

ranchodeluxe commented 7 months ago

ticket: https://github.com/pangeo-forge/pangeo-forge-recipes/issues/675

ranchodeluxe commented 7 months ago

calling this a success and changing to passing since I've run multiple years on Flink and locally to prove it works. There still seem to be holes for some datasets that result in 404(s) so we'll have to figure out where the holes are incrementally, work around them and tell folks upstream

ranchodeluxe commented 7 months ago

calling this a success and changing to passing since I've run multiple years on Flink and locally to prove it works. There still seem to be holes for some datasets that result in 404(s) so we'll have to figure out where the holes are incrementally, work around them and tell folks upstream

Seems we have holes after ~15 years of data (which makes sense based on history of IMERG and TRMM).

This run can do all 14 years in ~9 minutes with parallelism:5: https://github.com/NASA-IMPACT/veda-pforge-job-runner/actions/runs/7679542574

abarciauskas-bgse commented 7 months ago

This run can do all 14 years in ~9 minutes with parallelism:5: https://github.com/NASA-IMPACT/veda-pforge-job-runner/actions/runs/7679542574

That's awesome!!! 🥳

ranchodeluxe commented 6 months ago

@abarciauskas-bgse: profile of LocalDirectRunner.num_workers=1 on JH with this GPM IMERGE recipe. So same memory pattern locally (just not as drastic as TRMM) and possibly is able to run on Flink b/c the current per-worker resourcing is so large that it just works. More to look into. About to run LEAP (which is only StoreToZarr) and definitely will have the same pattern. I have ideas on how to isolate the issue if it is a memory leak and will create at ticket

Screen Shot 2024-02-04 at 6 39 13 AM
ranchodeluxe commented 6 months ago

ticket: https://github.com/NASA-IMPACT/veda-pforge-job-runner/issues/32