RMI-PACTA / workflow.prepare.pacta.indices

This repository is used to run indices through PACTA, and prepare them for transition monitor.
Other
2 stars 0 forks source link

Update github action to check if outputs already exist #103

Open AlexAxthelm opened 2 weeks ago

AlexAxthelm commented 2 weeks ago

Runnning index prep is a long process that's part of the current build process for workflow.transition.monitor, and will probably be for workflow.pacta.webapp as well.

Given that the indices don't actually change that much, it would make sense to do a check if the process needs to actually run, or if we can bypass it, and return the previous results instead.

My general thinking is to construct a hash based on:

and use that as a versioning key, and then we can check if the appropriate files exist in the blob store/AFS

so it would look something like:

# Download all the files from AZ
# Pull the Base image

- run: |
  hash=some_magic_function(all those inputs)
  echo "hash=$hash" >> "$GITHUB_OUTPUT"

- name: check if file exists on AZ
  run: |
    response=$(az storage blob exists --blog-url "$CONTAINER_URL/$hash/foo.rds" )
    parsed_response="$(jq '.some_filter')" >> "$GITHUB_OUTPUT"

- if: ${{ parsed_response }}
  # early return of the Blob URL

- if: ${{ parsed_response }}
  #run the rest of the process normally

cc @jdhoffa @cjyetman, Do I have the hashkeys right, or am I missing something?

jdhoffa commented 1 day ago

@AlexAxthelm there are benchmark inputs that are scraped on the fly using the pacta.data.scraping package, e.g. https://github.com/RMI-PACTA/pacta.data.scraping/blob/main/R/get_ishares_index_data.R

I guess the hash would need to depend also on the result of that scraping to be complete?

cjyetman commented 1 day ago

Conceptually, makes sense to me, though...


Scratch all that... I forgot how this repo is being used. I'll have to think about that once my memory has improved.

AlexAxthelm commented 1 day ago
  • when / how often would the situation happen that the same TM Docker image is being used with the same data and the same benchmark portfolios? possibly on a re-run of actions on a PR when no changes have been made?

This happens a lot. we push a lot of changes to workflow.transition.monitor that aren't changing any of the processing code, or don't require a rebuild of the docker image (or rather, the entire image can be rebuilt from cache)