RMI-PACTA / workflow.prepare.pacta.indices

This repository is used to run indices through PACTA, and prepare them for transition monitor.
Other
2 stars 0 forks source link

consider mounting data folder as sub-directory of `pacta-data/` #63

Open cjyetman opened 4 months ago

cjyetman commented 4 months ago

The current docker-compose.yml expects that you point to a local pacta-data/ directory which should have a sub-directory named for the desired timestamp/quarter, e.g. pacta-data/2021Q4. Given our current strategy of building PACTA data directories, I think it might be more natural to point directly to that sub-directory, since one might have a directory with numerous sub-directories each containing different builds/versions of PACTA data, but you really only want/need one, and it could be confusing which one is being used.

So instead of this in the docker-compose.yml https://github.com/RMI-PACTA/workflow.prepare.pacta.indices/blob/d93b71c58659f4ebdd894bf358fb3c5b110d01dc/docker-compose.yml#L8-L9

and this in the .env

PACTA_DATA_PATH=PATH/TO/pacta-data
R_CONFIG_ACTIVE=2022Q4

we might have this is the docker-compose.yml

    volumes:
      - ${PACTA_DATA_PATH}:/pacta-data/2022Q4

and this in the .env

PACTA_DATA_PATH=PATH/TO/pacta-data/2022Q4_20240215
R_CONFIG_ACTIVE=2022Q4

The only problem is that the /pacta-data/2022Q4 in the docker-compose.yml would need to adapt to whatever timestamp/quarter you're targeting, and I'm not sure how best to resolve that.

AB#10133

cjyetman commented 4 months ago

Reopening because #87 reveals this is still not adequately fixed. If you have R_CONFIG_ACTIVE=PA2024CH in your .env, that will activate the PA2024CH config, but it will also set the path that the PACTA data is mounted to as /pacta-data/PA2024CH/, based on target: /pacta-data/${R_CONFIG_ACTIVE} in the docker-compose.yml

jdhoffa commented 4 months ago

Hmm not ideal. I will open a hotfix to get it working again, and open an issue to track progress on designing a more appropriate solution.

jdhoffa commented 4 months ago

Ah so it turns out that this issue is already the "follow-up issue" i was going to open haha. Totally agree with the strategy proposed here.

(I think @AlexAxthelm closed this issue before I had a chance to read it XD)

AlexAxthelm commented 4 months ago

Hmm. I remember thinking when I saw that R_CONFIG_ACTIVE in a path that it would probably give us grief at some point. wasn't expecting it within a week.

jdhoffa commented 4 months ago

😂 for real I don't think such a big deal to solve though tbh

cjyetman commented 4 months ago

I think the logic is:

jdhoffa commented 4 months ago

The issue there is that it conflicts with the assumption that config.yml should set that information somehow

(e.g. it would be possible in that situation that PACTA_DATA_TIMESTAMP point to 2023Q4, R_CONFIG_ACTIVE points to PA2024CH BUT in config.yml that PA2024CH points to 2024Q4 or something like that)

seems like an unnecessary maintenance burden, so I am going to think it through more cautiously.

cjyetman commented 4 months ago

yeah, I guess that's what I meant... the PACTA data timestamp is necessary in the docker-compose step, and since that takes input from env vars not the config, the PACTA data timestamp kinda needs to be defined in the env vars not the config

jdhoffa commented 4 months ago

I think one of the tricky parts is that workflow.transition.monitor expects a directory called 2023Q4 but pacta-data directories are now named e.g. 2023Q4_20240229T200630Z

So already, I need to mount in the contents of a directory into a newly named directory. This is why I can't just mount 2023Q4_20240229T200630Z into pacta-data which I think would be the most preferable approach.

Thoughts?

cjyetman commented 4 months ago

I think one of the tricky parts is that workflow.transition.monitor expects a directory called 2023Q4 but pacta-data directories are now named e.g. 2023Q4_20240229T200630Z

yes, agree 💯, that is the crux of it

Though the additional thing that plays a role is, you may need the mounted directory to look like pacta-data/2022Q4 or pacta-data/2021Q4 depending on what quarter you run. That's a holdover from the olden days when we allowed a TM Docker image to be able to run multiple timestamps/quarters. Generally speaking, we're not doing that anymore, but the capability is still there. I think in the future, e.g. with workflow.pacta, that might not be an issue anymore, because I designed it to have the data outside the Docker image, so it can be mounted where/however you want and the input config determines where the code should look for it, e.g there's no strong restriction like pacta-data/2023Q4, it could be /2023Q4_20240229T200630Z as long as the config has "data_dir": "/2023Q4_20240229T200630Z"

jdhoffa commented 4 months ago

Yup totally. I think all of this points to prioritizing superseding it with workflow.pacta (i know you both agree, just saying it aloud).

I am tempted to not work much on this issue until after that is done/ after CH COP is delivered tbh (I don't want to jump through a bunch of hoops with an odd architecture if we have the opportunity to just update the architecture instead)