Open jdhoffa opened 7 months ago
Another bit of weirdness... in order to process the indices, technically only workflow.pacta
would be needed (a/the report/s are not needed)... though I think the suggestion is here because (at the moment) the only time we use index results as a comparison is in reports. That could hypothetically change though, and then it might make sense for workflow.pacta
to process a portfolio and whatever other benchmark/comparison portfolios are relevant. Or maybe an in-between workflow, like workflow.pacta.with.comparisons
, that would run workflow.pacta
on multiple portfolios, put their results in an appropriate place, and could then pass on to workflow.pacta.report
or any other workflow that takes "PACTA results" and formats them into some kind of report/output.
Definitely.
In general, I think an appropriate process would be:
reports
)I think workflow.pacta
has clear utility in multiple realms, so makes sense to refactor it out. Indices processing ATM is only useful for report generation, so makes sense for it to happen/ live there, but if we end up needing it elsewhere good idea to refactor it :-)
I haven't fully thought this through, but my inclination would be to do something along the lines of storing index benchmarks separate from the rest of the pacta-data, and then accessing them procedurally using a unique ID of some kind (maybe the SHA of the workflow.pacta
docker image used)?
which would give us something along the lines of
├── pacta-data
│ ├── 20224Q4_dataset1
│ ├── 20224Q4_dataset2
│ └── 2023Q4_dataset1
└── benchmarks
├── index1_a428de44a9059f31a59237a5881c2d2cffa93757d99026156e4ea544577ab7f3
├── index1_220611111e8c9bbe242e9dc1367c0fa89eef83f26203ee3f7c3764046e02b248
└── index2_220611111e8c9bbe242e9dc1367c0fa89eef83f26203ee3f7c3764046e02b248
and then we could grab the relevant benchmark files via something like:
benchmark_results <- list.dirs("benchmarks", pattern = unique_id)
and raise an error if we don't get what we expect to see (correct list of benchmarks)
In any case I'm still sorting out a good way to extract properties of a docker image from inside a container (not readily available, apparently). A possible alternate to using the SHA of the base image would be to encode some string on a file in workflow.pacta
via something along the lines of
RUN git rev-parse HEAD >> /usr/local/etc/workflow.pacta.txt && \
echo "$RANDOM" >> /usr/local/etc/workflow.pacta.txt
and read the unique ID from there.
Definitely open to suggestions on this.
Also, worth noting that preparing indices for workflow.transition.monitor
should not be included in this process
I haven't fully thought this through, but my inclination would be to do something along the lines of storing index benchmarks separate from the rest of the pacta-data, and then accessing them procedurally using a unique ID of some kind (maybe the SHA of the
workflow.pacta
docker image used)?which would give us something along the lines of
├── pacta-data │ ├── 20224Q4_dataset1 │ ├── 20224Q4_dataset2 │ └── 2023Q4_dataset1 └── benchmarks ├── index1_a428de44a9059f31a59237a5881c2d2cffa93757d99026156e4ea544577ab7f3 ├── index1_220611111e8c9bbe242e9dc1367c0fa89eef83f26203ee3f7c3764046e02b248 └── index2_220611111e8c9bbe242e9dc1367c0fa89eef83f26203ee3f7c3764046e02b248
and then we could grab the relevant benchmark files via something like:
benchmark_results <- list.dirs("benchmarks", pattern = unique_id)
and raise an error if we don't get what we expect to see (correct list of benchmarks)
In any case I'm still sorting out a good way to extract properties of a docker image from inside a container (not readily available, apparently). A possible alternate to using the SHA of the base image would be to encode some string on a file in
workflow.pacta
via something along the lines ofRUN git rev-parse HEAD >> /usr/local/etc/workflow.pacta.txt && \ echo "$RANDOM" >> /usr/local/etc/workflow.pacta.txt
and read the unique ID from there.
Definitely open to suggestions on this.
This sounds reasonable! I guess the main thing to note is that the directory benchmark
would only need to be volume mounted into workflow.pacta.report
(not workflow.pacta
). And so, I guess running workflow.pacta.report
would then check if benchmarks
contains what it needs, if it does, it would use them, and if it doesn't it would generate them, using the same workflow.pacta
image that it uses for the rest of the process...? does this make sense?
I guess running workflow.pacta.report would then check if benchmarks contains what it needs, if it does, it would use them, and if it doesn't it would generate them, using the same workflow.pacta image that it uses for the rest of the process
That would be a reasonable behavior in theory, but in practice, I'm more likely to want to have benchmarks
with a read-only mount, so the process would be roughly
workflow.pacta
versionworkflow.report
imageI think that makes sense? Let's discuss in a call?
It would be useful to use the "same" build of
workflow.pacta
to run both the input portfolio, and the input indices, necessary for generating the PACTA report.A critical consideration, flagged by @cjyetman , is that it may not be desirable from a time efficiency perspective, to re-run the same indices on every portfolio upload. Caching seems like an appropriate solution there, however the webApp would then need some form of long-lived "cached" folder, that multiple different users could access.
Relates to (there is fascinating discussion there!) https://github.com/RMI-PACTA/workflow.prepare.pacta.indices/issues/46