gentzkow / GentzkowLabTemplate

MIT License
0 stars 2 forks source link

Add inputs.txt to handle internal file dependencies #24

Open lucamlouzada opened 2 days ago

lucamlouzada commented 2 days ago

This issue is part of an effort to implement substantive improvements to the lab template, as discussed in https://github.com/gentzkow/GentzkowLabTemplate/issues/16.

In this issue, the goal is to work on the way the template handles input files. The approach to be adopted per the decision in plans for next steps is:

I am assigning myself to work on this. I will consider whether the creation of the symlinks should be done in make.sh itself or in an additional script stored in lib/shell. I will also investigate whether there is a way to scan over all scripts and automate the identification of paths.

lucamlouzada commented 2 days ago

I have created the inputs.txt files in 276cb0c. There should also be changes in make.sh. To prevent conflicts with the changes in other branches, I will list the changes here, but will wait to push them until #18 and #20 have been closed.

The changes in make.sh should look as follows:

- # Copy and/or symlink input files to local /input/ directory
- # (Make sure this section is updated to pull in all needed input files!)
+ # Add symlink input files to local /input/ directory
+ # (Make sure inputs.txt is updated to pull in all needed input files!)
rm -rf "${MAKE_SCRIPT_DIR}/input"
mkdir -p "${MAKE_SCRIPT_DIR}/input"
- # cp my_source_files "${MAKE_SCRIPT_DIR}/input/"
+ if [[ -f "inputs.txt" ]]; then # check if inputs.txt exists
+ links_created=false
+ while IFS= read -r file_path; do
+ if [[ -n "$file_path" && "$file_path" != \#* ]]; then  # skip empty or commented out lines 
+       if [[ -f "$file_path" ]]; then  # check if the file_path is valid
+       file_name=$(basename "$file_path")
+       abs_path=$(realpath "$file_path") # get absolute path
+       ln -sf "$abs_path" "${MAKE_SCRIPT_DIR}/input/$file_name" # create symlink in the input folder
+       links_created=true
+       else
+         echo "Error: $file_path does not exist or is not a valid file path." >&2
+       false # trigger error handler
+       fi
+       fi
+       done < "inputs.txt"
+       if [[ "$links_created" == true ]]; then
+       echo -e "\nAll input links were created!"
+       else
+         echo -e "\n\033[0;34mNote:\033[0m There were no input links to create."
+       fi
+       else
+         echo -e "\nError: No inputs.txt file found in the module." | tee -a "${LOGFILE}"
+       false # trigger error handler
+       fi

Note that this change will also require updating the template documentation and the instructions on how to run the example scripts.

lucamlouzada commented 2 days ago

I have also done some research regarding whether it would be possible to implement automated scanning of file dependencies. One alternative is using a DAG structure with software like SCons, Snakemake, or Nextflow, but this seems like a more complex step. We could also write ad hoc scripts to do it manually, but it would require some creative combination of regex and searching for functions such as "load" or "read_csv", which does not seem very practical and would increase the complexity of the template without large returns. My sense is that we can stick with manually adding input files for now, either with the inputs.txt approach introduced in this issue, or with the previous approach in which users manually added the files in make.sh with cp my_source_files "${MAKE_SCRIPT_DIR}/input/".