NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
5 stars 0 forks source link

Ecoregion covariates #376

Open mitchellmanware opened 1 month ago

mitchellmanware commented 1 month ago

@sigmafelix @kyle-messier I am receiving errors when calculating ecoregion covariates from within our container.

When I directly call the amadeus::*_ecoregions functions in two separate targets, the process_ecoregion performs as expected, but the calc_ecoregions returns a pointer-related issue

    targets::tar_target(
      process_ecoregion,
      command = amadeus::process_ecoregion(
        paste0(
          arglist_common$char_input_dir,
          "/ecoregions/data_files",
          "/us_eco_l3_state_boundaries.shp"
        )
      )
    )
    ,
    targets::tar_target(
      calc_ecoregion,
      command = amadeus::calc_ecoregion(
        from = process_ecoregion,
        locs = sf_feat_proc_aqs_sites,
        locs_id = arglist_common$char_siteid
      )
    )
    ,
  ...
● completed target process_ecoregion [10.092 seconds, 491.329 kilobytes]
▶ dispatched target calc_ecoregion
▶ recorded workspace calc_ecoregion
✖ errored target calc_ecoregion
Error: external pointer is not valid

When I use the functions together, I am getting an error related to mismatched rows.

    targets::tar_target(
      dt_feat_calc_ecoregions,
      command = {
        download_ecoregions
        data.table::data.table(
          amadeus::calc_ecoregion(
            from = amadeus::process_ecoregion(
              path = paste0(
                arglist_common$char_input_dir,
                "/ecoregions/data_files",
                "/us_eco_l3_state_boundaries.shp"
              )
            ),
            locs = sf_feat_proc_aqs_sites,
            locs_id = arglist_common$char_siteid
          )
        )
      },
      resources = targets::tar_resources(
        crew = targets::tar_resources_crew(
          controller = "calc_controller"
        )
      ),
      description = "data.table of Ecoregions features (fit)"
    )
...
✖ errored target dt_feat_calc_ecoregions
Error: Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 927, 1, 886

function (from = NULL, locs, locs_id = "site_id", geom = FALSE, 
    ...) 
NULL
Please refer to the argument list and the error message above to rectify the error.

✖ errored pipeline [25.79 seconds

Could this be related to the fixation on Tukey's bridge, which was previously discussed in https://github.com/NIEHS/beethoven/issues/211?

kyle-messier commented 1 month ago

@mitchellmanware I didn't realize you were getting this external pointer issue. This could be a targets issue with returning a terra object. geotargets was developed to deal with these C++ external pointer issues. If we are trying to return this a terra SpatVector then try replacing tar_target with tar_terra_vect. If the amadeus function must return an sf object, then it may be not be related to this.

https://github.com/njtierney/geotargets

kyle-messier commented 1 month ago

@mitchellmanware @sigmafelix Looking at it more closely the error Error: external pointer is not valid is the exact error that geotargets says it is designed to handle. Hopefully replacing with tar_terra_vect will resolve it.

You may have to update and rerun the .def file. I found that the suggested install for geotargets was failing on the container build and that the approach to install that worked was Rscript -e "devtools::install_github('njtierney/geotargets')"

sigmafelix commented 1 week ago

@kyle-messier @mitchellmanware Pointer errors are from terra objects that are exported to parallel workers, which is not allowed in mirai daemons. geotargets way of resolving this problem is feasible only when each dataset that should be marshaled/unmarshaled is in a manageable size (e.g., several hundred MBs). Instead, I think we may try emulate "lazy" way of calculation where we pass the file paths to each parallel worker (i.e., branched targets) then functions there will actually process and calculate targets. That said, no terra objects are generated or transferred in the outer target, rather a branch is self-contained with internal terra processing parts and file paths (character vector/list).

mitchellmanware commented 6 days ago

@sigmafelix That makes sense. I'll give it a try.