MathMarEcol / pdyer_aus_bio

GNU General Public License v3.0
0 stars 0 forks source link

+TITLE: Project for finding a Bioregionalisation around Australia

Copyright 2017-2024 Philip Dyer

SPDX-License-Identifier: CC-BY-4.0

The source code requires some datasets to be available locally. Other datasets are downloaded on demand, and cached.

The output consists of R objects, stored in an R ~targets~ cache, and plots, stored in an ~outputs~ folder.

The branch ~f_varied_res~ was used to generate results for the thesis.

Further development on the source code will take place at https://github.com/MathMarEcol/pdyer_aus_bio

This code is published as part of academic research, I do not intend on keeping the source "closed". I will release appropriate licensing information after consulting with my institution.

Once the license is released, you should be able to modify the code to fit your environment and extend the research.

  1. Get access to a Slurm workload manager on a Linux system, or modify the code to use another scheduler. The code currently assumes you have access to a Slurm workload manager. Many HPCs use Slurm. You can set up Slurm on a local computer, but how to do that is beyond the scope of this document.
  2. Make sure the ~nix~ package manager is on the path for all compute workers.
  3. Set up folders
    1. Set up the storage location for long term storage. This will be ~ROOT_STORE_DIR~. Outputs, datasets, logs, and caches will be zipped and stored at this location.
    2. Prepare the computing scratch location. Often this will be dynamically generated by the workload manager.
    3. Make sure ~$TMPDIR~ on the workers has a lot of space available.
  4. Access datasets and put them in the appropriate folder.
    1. All datasets are stored in subfolders of ~$ROOT_STORE_DIR/data~
  5. Some modifications to the code will be needed
    1. create a new HName entry in ~./shell/aus_bio_submit.sh~, follow the existing examples and make sure all env vars are defined, and ends with a call to ~sbatch aus_bio_batch.sh~
    2. create a new Host entry in ~./R/functions/configure_parallel.R~, follow the existing examples and make sure every worker type is defined
    3. These are the places to add other workload managers as well
  6. run ~./shell/aus_bio_submit.sh f_varied_res slurmacctstring ROOT_STORE_DIR_subfoldername

The datasets will be pulled to the working directory, the analysis will be performed in the working directory, then logs, some datasets, plots, and the R targets cache will be packed up and copied back to ~ROOT_STORE_DIR/subfoldername~.

If the analysis does not complete, the partial results will be copied back. Subsequent runs will reuse the R targets cache to avoid re-running code that succesfully completed and has not changed.

** Getting nix

The blog post https://zameermanji.com/blog/2023/3/26/using-nix-without-root/ provides info about setting up Nix even if you do not have administrator rights on the machine.

In summary:

  1. ~curl -L https://hydra.nixos.org/job/nix/maintenance-2.20/buildStatic.x86_64-linux/latest/download-by-type/file/binary-dist > nix~
  2. Put the downloaded ~nix~ binary on the path
    1. Some HPC systems have a ~bin~ folder in each user's home directory that can be used to add binaries to the path.
  3. Add and edit ~~/.config/nix/nix.conf~. The important settings are:

    +begin_src conf

    store = ~/mynixroot extra-experiment al-features = flakes nix-command ssl-cert-file = /etc/pki/tls/cert.pem

    +end_src

    where ~store~ is the location of the nix store, where all software will go.

*** Notes on Nix The store path can end up with ~50GB or more easily, and uses a large number of inodes. ~nix store gc~ will remove excess files.

Datasets and license notes Australian Microbiome Initiative Data downloaded from https://data.bioplatforms.com/bpa/otu on 2019-07-03.

License depends on sample project, but samples I looked at were CC-BY-4.0-AU.

Login is required.

Amplicon was set to ~XXXXXX_bacteria~ and, under contexual filter, Environment was set to ~marine~.

Then download OTU and contextual data as CSV.

*** BioORACLE BioORACLE data are downloaded at runtime and cached. However, make sure that an empty folder is present at ~$ROOT_STORE_DIR/data/bioORACLE~.

The R package ~sdmpredictors~ or ~biooracler~ is used to load the dataset.

License is GPL (version not specified, see https://bio-oracle.org/downloads-to-email.php).

*** AusCPR

Data is available through IMOS and R package ~planktonr~.

Some data is fetched on demand from ~planktonr~, no further action is needed.

Other data has been preprocessed for this project, please clone https://github.com/MathMarEcol/aus_cpr_for_bioregions into ~~$ROOT_STORE_DIR/data/AusCPR/~

AODN prefers CC-BY-4.0

AusCPR is CC-BY-4.0

*** World EEZ v8 Sourced from https://marineregions.org/downloads.php.

License is CC-BY-NC-SA

Place extracted shapefiles into ~$ROOT_STORE_DIR/data/ShapeFiles/World_EEZ_v8/~

Source code assumes shapefiles are named ~World_EEZ_v8_2014_HR~

*** MPA polygons

Sourced from the World Database of Protected Areas (WDPA https://www.protectedplanet.net/country/AUS).

Non-commercial use with attribution required.

Download the .SHP variant.

Note that WDPA splits the dataset up into three separate datasets. The source code assumes each dataset will be extracted and placed into:

Either follow this convention or modify ~./R/functions/get_mpa_polys.R~.

*** Watson Fisheries Data

Published Watson and Tidd https://doi.org/10.25959/5c522cadbea37

CC-BY-4.0 for data

Version 4 is available publically. V5 is behind a login, and source code expects some preprocessing.

As I do not currenlty have permission to share V5 and the preprocessing scripts, functionality related to this dataset has been commented out.

** Directory structure :PROPERTIES: :ID: org:09e255e4-a92d-439c-b959-6b998e00880f :END:

The whole project is assumed to be inside the MathMarEcol QRIScloud collection ~Q1216/pdyer~. The

The ~code/~ folder contains the drake_plan.R and other scripts and code for the project.

The data are all stored in a different QRIScloud collection, ~Q1215~. Different HPC systems have a different folder for the QRIScloud data, but Q1215 and Q1216 are always sibling folders, so relative paths will work, and will be more reliable than hard paths.

Given that HPC code should not be run over the network, I copy the relevant parts of ~Q1215~ and ~Q1216~ into ~30days~ or something similar on Awoonga, before running ~Rscript drake_plan.R~

** Update for targets and crew

Crew provides a unified frontend for workers.

No longer need to differentiate between local and cluster execution, or call a different top-level function depending on whether future, clustermq or sequential execution are needed. Always call ~tar_make()~ and ensure the ~controllers~ tar_option is set appropriately.

*** Balancing workloads

Each target has a distinct resource requirement.

Some are small and fast, some require lots of memory, some internally use paralellisation, and benefit from having lots of cores available.

Experience tells me that it is better to compute targets sequentially rather than in parallel if the total runtime is the same. Parallel computation should only be used if there are spare resources.

In practice, this means that branches that internally run in parallel should be given the whole node.

RAM requirements are set per job, 4GB is enough for many small jobs. Bigger jobs will need tuning according to the dataset, can use 100's of GB.

*** Making sure the right controllers are used

One goal is to make the code run in different environments with minimal changes.

Crew helps, but different controllers are needed for different environments, eg. local vs slurm.

I may end up needing to use the configure_parallel function to just list controllers, and use some flag to choose between them.

*** Future framework

Targets will use crew to assign branches to workers.

Some functions can run in parallel, but all use the future framework to decide if it is possible.

crew might be able to set up future plans for workers that expect multicore operations. It doesn't seem to. Each target could set the plan just before calling the function. Given that the resoureces are specified in the same place, the relevant information would be kept together.

future.callr is probably the most flexible and reliable for running within a single node. future.mirai is under development, but locally it behaves largely like future.callr.

If you really don't have access to slurm or a workload manager:

  1. ~git clone -b f_varied_res --single-branch https://github.com/MathMarEcol/pdyer_aus_bio.git ./code~
  2. Copy all datasets into subfolders of ~./code/R/data~, see ~./shell/aus_bio_control.sh~ for the appropriate folder names
  3. From ~./code/R~, call ~R --vanilla -e "targets::tar_make(reporter = 'verbose_positives')"~
    1. To avoid issues with R package mismatches, put nix on your path and call ~NIX_GL_PREFIX="nixglhost -- "; nix develop github:PhDyellow/nix_r_dev_shell/${R_SHELL_REV}#devShells."x86_64-linux".r-shell -c $NIX_GL_PREFIX R --vanilla -e "targets::tar_make(reporter = 'verbose_positives')"~
    2. Leave out NIX_GL_PREFIX if you are not using a GPU or are on NixOS. If not using a GPU, make sure any calls to TENSOR_DEVICE are not set to ~CUDA~ in ~./code/R/functions/configure_parallel.R~

This work © 2024 by Philip Dyer is licensed under CC BY 4.0