RMI-PACTA / workflow.pacta.report

Other
0 stars 0 forks source link

Consider if it's possible to run PACTA on both `input_portfolio` AND indices portfolios, using the same `workflow.pacta` based image #3

Open jdhoffa opened 7 months ago

jdhoffa commented 7 months ago

It would be useful to use the "same" build of workflow.pacta to run both the input portfolio, and the input indices, necessary for generating the PACTA report.

A critical consideration, flagged by @cjyetman , is that it may not be desirable from a time efficiency perspective, to re-run the same indices on every portfolio upload. Caching seems like an appropriate solution there, however the webApp would then need some form of long-lived "cached" folder, that multiple different users could access.

Relates to (there is fascinating discussion there!) https://github.com/RMI-PACTA/workflow.prepare.pacta.indices/issues/46

cjyetman commented 7 months ago

Another bit of weirdness... in order to process the indices, technically only workflow.pacta would be needed (a/the report/s are not needed)... though I think the suggestion is here because (at the moment) the only time we use index results as a comparison is in reports. That could hypothetically change though, and then it might make sense for workflow.pacta to process a portfolio and whatever other benchmark/comparison portfolios are relevant. Or maybe an in-between workflow, like workflow.pacta.with.comparisons, that would run workflow.pacta on multiple portfolios, put their results in an appropriate place, and could then pass on to workflow.pacta.report or any other workflow that takes "PACTA results" and formats them into some kind of report/output.

jdhoffa commented 7 months ago

Definitely.

In general, I think an appropriate process would be:

I think workflow.pacta has clear utility in multiple realms, so makes sense to refactor it out. Indices processing ATM is only useful for report generation, so makes sense for it to happen/ live there, but if we end up needing it elsewhere good idea to refactor it :-)

AlexAxthelm commented 7 months ago

I haven't fully thought this through, but my inclination would be to do something along the lines of storing index benchmarks separate from the rest of the pacta-data, and then accessing them procedurally using a unique ID of some kind (maybe the SHA of the workflow.pacta docker image used)?

which would give us something along the lines of

├── pacta-data
│   ├── 20224Q4_dataset1
│   ├── 20224Q4_dataset2
│   └── 2023Q4_dataset1
└── benchmarks
    ├── index1_a428de44a9059f31a59237a5881c2d2cffa93757d99026156e4ea544577ab7f3
    ├── index1_220611111e8c9bbe242e9dc1367c0fa89eef83f26203ee3f7c3764046e02b248
    └── index2_220611111e8c9bbe242e9dc1367c0fa89eef83f26203ee3f7c3764046e02b248

and then we could grab the relevant benchmark files via something like:

benchmark_results <- list.dirs("benchmarks", pattern = unique_id)

and raise an error if we don't get what we expect to see (correct list of benchmarks)

In any case I'm still sorting out a good way to extract properties of a docker image from inside a container (not readily available, apparently). A possible alternate to using the SHA of the base image would be to encode some string on a file in workflow.pacta via something along the lines of

RUN git rev-parse HEAD >> /usr/local/etc/workflow.pacta.txt && \
  echo "$RANDOM" >> /usr/local/etc/workflow.pacta.txt

and read the unique ID from there.

Definitely open to suggestions on this.

AlexAxthelm commented 7 months ago

Also, worth noting that preparing indices for workflow.transition.monitor should not be included in this process

jdhoffa commented 7 months ago

I haven't fully thought this through, but my inclination would be to do something along the lines of storing index benchmarks separate from the rest of the pacta-data, and then accessing them procedurally using a unique ID of some kind (maybe the SHA of the workflow.pacta docker image used)?

which would give us something along the lines of

├── pacta-data
│   ├── 20224Q4_dataset1
│   ├── 20224Q4_dataset2
│   └── 2023Q4_dataset1
└── benchmarks
    ├── index1_a428de44a9059f31a59237a5881c2d2cffa93757d99026156e4ea544577ab7f3
    ├── index1_220611111e8c9bbe242e9dc1367c0fa89eef83f26203ee3f7c3764046e02b248
    └── index2_220611111e8c9bbe242e9dc1367c0fa89eef83f26203ee3f7c3764046e02b248

and then we could grab the relevant benchmark files via something like:

benchmark_results <- list.dirs("benchmarks", pattern = unique_id)

and raise an error if we don't get what we expect to see (correct list of benchmarks)

In any case I'm still sorting out a good way to extract properties of a docker image from inside a container (not readily available, apparently). A possible alternate to using the SHA of the base image would be to encode some string on a file in workflow.pacta via something along the lines of

RUN git rev-parse HEAD >> /usr/local/etc/workflow.pacta.txt && \
  echo "$RANDOM" >> /usr/local/etc/workflow.pacta.txt

and read the unique ID from there.

Definitely open to suggestions on this.

This sounds reasonable! I guess the main thing to note is that the directory benchmark would only need to be volume mounted into workflow.pacta.report (not workflow.pacta). And so, I guess running workflow.pacta.report would then check if benchmarks contains what it needs, if it does, it would use them, and if it doesn't it would generate them, using the same workflow.pacta image that it uses for the rest of the process...? does this make sense?

AlexAxthelm commented 7 months ago

I guess running workflow.pacta.report would then check if benchmarks contains what it needs, if it does, it would use them, and if it doesn't it would generate them, using the same workflow.pacta image that it uses for the rest of the process

That would be a reasonable behavior in theory, but in practice, I'm more likely to want to have benchmarks with a read-only mount, so the process would be roughly

jdhoffa commented 7 months ago

I think that makes sense? Let's discuss in a call?