NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
5 stars 0 forks source link

`crew` and `apptainer` based refactoring #361

Open kyle-messier opened 1 month ago

kyle-messier commented 1 month ago

Refactoring Pipeline for crew and apptainer

Notes and Checklist for Updating, Started by @kyle-messier

Design targets for optimal parallelization and updating

Download

Result: raw data downloaded Updating: Skip based on pattern

  • filename config
  • branching by file
  • pattern: dataset, variable, year, {location?}
  • set_args_download is a good start for a config file
  • [ ] set_args_download output should be same length/size

Process

Result: sf and terra objects for aqs and covariates

  • Process branching can mirror the download branches
  • Merge branches by dataset
kyle-messier commented 1 month ago

@mitchellmanware @sigmafelix

kyle-messier commented 1 month ago

@sigmafelix Can you provide some context on how mod06_links_2018_2022.csv was generated?

sigmafelix commented 1 month ago

@kyle-messier

Thank you for the suggestion and I understand this direction will be necessary as the software environment gets too difficult to figure out just for making everything work. As far as I recall, this is our third time revamping a significant portion of the pipeline. Given that the primary objective is to present the proof of concept at this stage of development, I'm a bit unsure how long it takes to resolve everything in the course of refactoring, which will add more time to advance the project.

Concerns aside, I would like to comment on the checklist:

sigmafelix commented 1 month ago

@kyle-messier

According to download_modis documentation: https://github.com/NIEHS/amadeus/blob/541bd6898f9e9aa8890a39b95ea1268e25977615/R/download.R#L2160-L2161.

https://ladsweb.modaps.eosdis.nasa.gov/search/order/4/MOD06_L2--61/[date1]..[date2]/DNB/-130,52,-60,20

We ask users to query MOD06_L2 products using a date range and a spatial extent in the linked page above to download a CSV file with file links.

sigmafelix commented 1 month ago

Refactoring (recoding, actually) idea: crew-based

Download Calculate Model
Feature1-Period1 Feature1-Period1
Feature1-Period2 Feature1-Period2
Feature2-Period1 Feature2-Period1
Feature2-Period1 Feature2-Period1
... ...
FeatureP-Period2 FeatureP-Period2

Question and future work

kyle-messier commented 1 month ago

graph TD

%%subraph for initializing
subgraph intitalize
  A1[Define temporal range list]
  A2[Parse temporal range into date format]
  A1 --> A2
 end

  %% Subgraph for Range k=1, Variable p=1
  subgraph branch_1_1
   A2 --> A3A[Download data using amadeus]
    A3A --> D1A[Calculate covariates - process, calc, impute]
    D1A --> E1[Output: S/T covariates]
  end

  %% Subgraph for Range k=1, Variable p=P
  subgraph branch_1_p
   A2 --> A3B[Download data using amadeus]
    A3B --> D2A[Calculate covariates - process, calc, impute]
    D2A --> E2[Output: S/T covariates]
  end

  %% Subgraph for Range k=K, Variable p=1
  subgraph branch_k_1
   A2 --> A3C[Download data using amadeus]
    A3C --> DK_A[Calculate covariates - process, calc, impute]
    DK_A --> EK[Output: S/T covariates]
  end

  %% Subgraph for Range k=K, Variable p=P
  subgraph branch_k_p
   A2 --> A3D[Download data using amadeus]
    A3D --> DP_A[Calculate covariates - process, calc, impute]
    DP_A --> EP[Output: S/T covariates]
  end

  %% Merging all outputs
  F[Merge Dataset]
  E1 --> F
  E2 --> F
  EK --> F
  EP --> F