MAAP-Project / maap-documentation

Apache License 2.0
10 stars 13 forks source link

Elaborate on DPS output folder usage #455

Open wildintellect opened 4 weeks ago

wildintellect commented 4 weeks ago

Turns out passing the output directory location to scripts is trickier than outlined in https://docs.maap-project.org/en/latest/technical_tutorials/dps_tutorial/dps_tutorial_demo.html#Output-folder Q: why do users need to make the output folder, if it's always needed as the correct location for outputs, should it exist already? @sujen1412

Specifically, this does not show how to pass the correct path to scripts, e.g. python and R. Also is the case of R, R is very sensitive to working directories when importing other R files, so the script can't always be run from the default path DPS starts at, which means the output folder is actually up one level from where the script executes.

Here's an R example (We're going to need R examples on par with the python dps_tutorial)

#!/bin/bash
basedir=$( cd "$(dirname "$0")" ; pwd -P )

mkdir -p output
outdir=${PWD}/ouput

country=${1}

#Rscript global_process_part1_2024_MAAP_step123.R $country
cd ${basedir}
conda run --no-capture-output --name gedi_pa_env Rscript global_process_part3_2024_MAAP_step5.r $country outdir
chuckwondo commented 4 weeks ago

@wildintellect, are you sure the code snippet is correct? I think there are at least 2 problems: (1) mkdir created an output dir before the outdir var is defined, and (2) the last line uses outdir which simply translates as the literal string "outdir", so it should change to $outdir (or ${outdir}).

I think this is what you want:

#!/bin/bash
basedir=$( cd "$(dirname "$0")" ; pwd -P )

outdir=${PWD}/output
mkdir -p "${outdir}"

country=${1}

#Rscript global_process_part1_2024_MAAP_step123.R $country
cd ${basedir}
conda run --no-capture-output --name gedi_pa_env Rscript global_process_part3_2024_MAAP_step5.r $country "${outdir}"

I've also surrounded ${outdir} in double quotes in both spots, just to avoid potential problems with space in the path.

wildintellect commented 4 weeks ago

(1) works either way, you're right that setting the var 1st is probably best practices (2) I don't know how that typo crept in , I didn't give it to the user that way

So yes @chuckwondo 's example is technically cleaner

chuckwondo commented 4 weeks ago

(1) works either way, you're right that setting the var 1st is probably best practices

It doesn't work when PWD is not the directory in which the script is sitting.

In other words, these 2 lines are not equivalent when PWD is not the directory containing the script containing these lines:

mkdir -p output
mkdir -p "${PWD}/output"
chuckwondo commented 4 weeks ago

In the context of DPS jobs, we should always use "${PWD}/output" and "${PWD}/input" as the locations of the output and input directories that the DPS system expects us to use.

There is also a typo in the original code in the first comment of this issue (ouput should be output), so here's what should be a fully corrected example, with the addition of an input dir variable, for reference:

#!/bin/bash
basedir=$( cd "$(dirname "$0")" ; pwd -P )

# If your `algorithm_config.yaml` file has any inputs configured as type `file` (rather
# than `positional`), the DPS will automatically download such inputs (which must be
# URLs) and place the downloaded files into the input directory.
indir=${PWD}/input

# All outputs must be written to the output directory. We must create the full path to
# the output directory (including parent directories using the `-p` flag).
outdir=${PWD}/output
mkdir -p "${outdir}"

# Collect positional inputs
country=${1}

#Rscript global_process_part1_2024_MAAP_step123.R ${country}
cd "${basedir}"
conda run --no-capture-output --name gedi_pa_env Rscript global_process_part3_2024_MAAP_step5.r ${country} "${outdir}"