Update github action to check if outputs already exist

AlexAxthelm commented 2 weeks ago

Runnning index prep is a long process that's part of the current build process for workflow.transition.monitor, and will probably be for workflow.pacta.webapp as well.

Given that the indices don't actually change that much, it would make sense to do a check if the process needs to actually run, or if we can bypass it, and return the previous results instead.

My general thinking is to construct a hash based on:

all the pacta-data,
files from this repo
benchmark_inputs files
docker image (SHA)

and use that as a versioning key, and then we can check if the appropriate files exist in the blob store/AFS

so it would look something like:

# Download all the files from AZ
# Pull the Base image

- run: |
  hash=some_magic_function(all those inputs)
  echo "hash=$hash" >> "$GITHUB_OUTPUT"

- name: check if file exists on AZ
  run: |
    response=$(az storage blob exists --blog-url "$CONTAINER_URL/$hash/foo.rds" )
    parsed_response="$(jq '.some_filter')" >> "$GITHUB_OUTPUT"

- if: ${{ parsed_response }}
  # early return of the Blob URL

- if: ${{ parsed_response }}
  #run the rest of the process normally

cc @jdhoffa @cjyetman, Do I have the hashkeys right, or am I missing something?

jdhoffa commented 1 day ago

@AlexAxthelm there are benchmark inputs that are scraped on the fly using the pacta.data.scraping package, e.g. https://github.com/RMI-PACTA/pacta.data.scraping/blob/main/R/get_ishares_index_data.R

I guess the hash would need to depend also on the result of that scraping to be complete?

cjyetman commented 1 day ago

Conceptually, makes sense to me, though...

I don't think the "files from this repo" matter, if I understand that correctly... either the prepared benchmark files ("benchmark_inputs files"? the benchmark portfolios that this workflow outputs?) are the same or not
when / how often would the situation happen that the same TM Docker image is being used with the same data and the same benchmark portfolios? possibly on a re-run of actions on a PR when no changes have been made?
this should be handled in workflow.transition.monitor in the Azure scripts no? since this repo never has any idea about what pacta-data or Docker image is being used there

Scratch all that... I forgot how this repo is being used. I'll have to think about that once my memory has improved.

AlexAxthelm commented 1 day ago

when / how often would the situation happen that the same TM Docker image is being used with the same data and the same benchmark portfolios? possibly on a re-run of actions on a PR when no changes have been made?

This happens a lot. we push a lot of changes to workflow.transition.monitor that aren't changing any of the processing code, or don't require a rebuild of the docker image (or rather, the entire image can be rebuilt from cache)

RMI-PACTA / workflow.prepare.pacta.indices

Update github action to check if outputs already exist #103