The source code requires some datasets to be available locally. Other datasets are downloaded on demand, and cached.
The output consists of R objects, stored in an R ~targets~ cache, and plots, stored in an ~outputs~ folder.
The branch ~f_varied_res~ was used to generate results for the thesis.
Further development on the source code will take place at https://github.com/MathMarEcol/pdyer_aus_bio
This code is published as part of academic research, I do not intend on keeping the source "closed". I will release appropriate licensing information after consulting with my institution.
Once the license is released, you should be able to modify the code to fit your environment and extend the research.
The datasets will be pulled to the working directory, the analysis will be performed in the working directory, then logs, some datasets, plots, and the R targets cache will be packed up and copied back to ~ROOT_STORE_DIR/subfoldername~.
If the analysis does not complete, the partial results will be copied back. Subsequent runs will reuse the R targets cache to avoid re-running code that succesfully completed and has not changed.
** Getting nix
The blog post https://zameermanji.com/blog/2023/3/26/using-nix-without-root/ provides info about setting up Nix even if you do not have administrator rights on the machine.
In summary:
store = ~/mynixroot extra-experiment al-features = flakes nix-command ssl-cert-file = /etc/pki/tls/cert.pem
where ~store~ is the location of the nix store, where all software will go.
*** Notes on Nix The store path can end up with ~50GB or more easily, and uses a large number of inodes. ~nix store gc~ will remove excess files.
Datasets and license notes Australian Microbiome Initiative Data downloaded from https://data.bioplatforms.com/bpa/otu on 2019-07-03.
License depends on sample project, but samples I looked at were CC-BY-4.0-AU.
Login is required.
Amplicon was set to ~XXXXXX_bacteria~ and, under contexual filter, Environment was set to ~marine~.
Then download OTU and contextual data as CSV.
*** BioORACLE BioORACLE data are downloaded at runtime and cached. However, make sure that an empty folder is present at ~$ROOT_STORE_DIR/data/bioORACLE~.
The R package ~sdmpredictors~ or ~biooracler~ is used to load the dataset.
License is GPL (version not specified, see https://bio-oracle.org/downloads-to-email.php).
*** AusCPR
Data is available through IMOS and R package ~planktonr~.
Some data is fetched on demand from ~planktonr~, no further action is needed.
Other data has been preprocessed for this project, please clone https://github.com/MathMarEcol/aus_cpr_for_bioregions into ~~$ROOT_STORE_DIR/data/AusCPR/~
AODN prefers CC-BY-4.0
AusCPR is CC-BY-4.0
*** World EEZ v8 Sourced from https://marineregions.org/downloads.php.
License is CC-BY-NC-SA
Place extracted shapefiles into ~$ROOT_STORE_DIR/data/ShapeFiles/World_EEZ_v8/~
Source code assumes shapefiles are named ~World_EEZ_v8_2014_HR~
*** MPA polygons
Sourced from the World Database of Protected Areas (WDPA https://www.protectedplanet.net/country/AUS).
Non-commercial use with attribution required.
Download the .SHP variant.
Note that WDPA splits the dataset up into three separate datasets. The source code assumes each dataset will be extracted and placed into:
Either follow this convention or modify ~./R/functions/get_mpa_polys.R~.
*** Watson Fisheries Data
Published Watson and Tidd https://doi.org/10.25959/5c522cadbea37
CC-BY-4.0 for data
Version 4 is available publically. V5 is behind a login, and source code expects some preprocessing.
As I do not currenlty have permission to share V5 and the preprocessing scripts, functionality related to this dataset has been commented out.
** Directory structure :PROPERTIES: :ID: org:09e255e4-a92d-439c-b959-6b998e00880f :END:
The whole project is assumed to be inside the MathMarEcol QRIScloud collection ~Q1216/pdyer~. The
The ~code/~ folder contains the drake_plan.R and other scripts and code for the project.
The data are all stored in a different QRIScloud collection, ~Q1215~. Different HPC systems have a different folder for the QRIScloud data, but Q1215 and Q1216 are always sibling folders, so relative paths will work, and will be more reliable than hard paths.
Given that HPC code should not be run over the network, I copy the relevant parts of ~Q1215~ and ~Q1216~ into ~30days~ or something similar on Awoonga, before running ~Rscript drake_plan.R~
** Update for targets and crew
Crew provides a unified frontend for workers.
No longer need to differentiate between local and cluster execution, or call a different top-level function depending on whether future, clustermq or sequential execution are needed. Always call ~tar_make()~ and ensure the ~controllers~ tar_option is set appropriately.
*** Balancing workloads
Each target has a distinct resource requirement.
Some are small and fast, some require lots of memory, some internally use paralellisation, and benefit from having lots of cores available.
Experience tells me that it is better to compute targets sequentially rather than in parallel if the total runtime is the same. Parallel computation should only be used if there are spare resources.
In practice, this means that branches that internally run in parallel should be given the whole node.
RAM requirements are set per job, 4GB is enough for many small jobs. Bigger jobs will need tuning according to the dataset, can use 100's of GB.
*** Making sure the right controllers are used
One goal is to make the code run in different environments with minimal changes.
Crew helps, but different controllers are needed for different environments, eg. local vs slurm.
I may end up needing to use the configure_parallel function to just list controllers, and use some flag to choose between them.
*** Future framework
Targets will use crew to assign branches to workers.
Some functions can run in parallel, but all use the future framework to decide if it is possible.
crew might be able to set up future plans for workers that expect multicore operations. It doesn't seem to. Each target could set the plan just before calling the function. Given that the resoureces are specified in the same place, the relevant information would be kept together.
future.callr is probably the most flexible and reliable for running within a single node. future.mirai is under development, but locally it behaves largely like future.callr.
If you really don't have access to slurm or a workload manager:
This work © 2024 by Philip Dyer is licensed under CC BY 4.0