NIEHS / amadeus

https://niehs.github.io/amadeus/
Other
7 stars 1 forks source link

`download_data()` to download missing files only #58

Closed mitchellmanware closed 7 months ago

mitchellmanware commented 7 months ago

@sigmafelix In reference to https://github.com/NIEHS/beethoven/blob/45e21d473642c5638273ec68272777c3f3b981cc/inst/targets/pipeline_base_functions.R#L175C1-L175C55:

Edits to download functions for only missing files incoming on amadeus branch mm-terraclimate-0325 Completed (https://github.com/NIEHS/amadeus/commit/54395807994c57ff8a5fd969abc3b31629b8e232)

To Do

mitchellmanware commented 7 months ago

Data download functions updated to only download missing/nonexistent files (https://github.com/NIEHS/amadeus/commit/54395807994c57ff8a5fd969abc3b31629b8e232 and (https://github.com/NIEHS/amadeus/commit/357f9f66e88c9e607491a71acb2bb1ea13a0f9ae).

Note In download_modis(), the -m flag has been removed from the the wget commands. In order to identify nonexisting files, I included object download_name to store download destination files. This was used with the -O flag to specify the download file name instead of just the download folder (which used the -P flag).

Before

    # Main wget run
    download_command <- paste0(
      "wget -e robots=off -m -np -R .html,.tmp ",
      "-nH --cut-dirs=3 \"",
      download_url,
      "\" --header \"Authorization: Bearer ",
      nasa_earth_data_token,
      "\" -P ",
      directory_to_save,
      "\n"
    )
    #### 15. concatenate and print download commands to "..._wget_commands.txt"
    cat(download_command)
  }

Now

    # Main wget run
    download_command <- paste0(
      "wget -e robots=off -np -R .html,.tmp ",
      "-nH --cut-dirs=3 \"",
      download_url,
      "\" --header \"Authorization: Bearer ",
      nasa_earth_data_token,
      "\" -O ",
      directory_to_save,
      download_name,
      "\n"
    )

    #### filter commands to non-existing files
    download_command <- download_command[
      which(
        !file.exists(download_name)
      )
    ]

    #### 15. concatenate and print download commands to "..._wget_commands.txt"
    #### cat command only if file does not already exist
    cat(download_command)

Better explanation from ChatGPT:

"When you combine the -O option (which specifies the output file) with the -r or -p options (which enable recursive downloading), wget will download all content into a single file rather than saving each file separately.

In your command, you're using the -m option, which is equivalent to -r -l inf --no-remove-listing, enabling mirroring. So, when -m is combined with -O, all downloaded content will be placed into the single file specified by -O, which is likely not the behavior you want.

If you intend to download only the specific file specified by the URL, remove the -m option."

sigmafelix commented 7 months ago

@mitchellmanware Thanks! It will help a lot for streamlining the download part of the beethoven pipeline.