Closed nawatts closed 3 years ago
To run locally, set
GOOGLE_APPLICATION_CREDENTIALS
andGOOGLE_CLOUD_PROJECT
environment variables.
Sorry missed this bit in the review but could this be put in the script? A comment or maybe in the docstrings?
Is there a reason you use print here instead of creating a logger?
As a user, I'd appreciate some sort of success message. Won't both 1 and 0 just end silently here?
These were both motivated by making the script work in combination with other scripts. Exiting 0 on success and non-zero on an error is conventional for UNIX programs (https://tldp.org/LDP/abs/html/exit-status.html). Adding a success message can be done with:
python diff_gcs_directories.py gs://bucket/path gs://bucket/other/path && echo "Directories are the same"
Printing to stdout makes it easy to process the output by piping it to other tools. For example, to filter the output to only the paths of changed files:
python diff_gcs_directories.py gs://bucket/path gs://bucket/other/path | grep '^\*' | cut -c 3-
Also, since the script is expected to run quickly, the timestamps provided by logging aren't as useful as they are with a long running process.
Gotcha, so if we run on multiple buckets the parent process can process the 0 or non-zero? That also makes sense with logging. Thanks for clarifying!
Gotcha, so if we run on multiple buckets the parent process can process the 0 or non-zero?
Yes. A shell script could do something like:
if diff_gcs_directories.py $url1 $url2; then
echo "Directories are the same"
else
echo "Directories are different"
fi
To run locally, set GOOGLE_APPLICATION_CREDENTIALS and GOOGLE_CLOUD_PROJECT environment variables.
Sorry missed this bit in the review but could this be put in the script? A comment or maybe in the docstrings?
Turns out the GCS client can be created without a project, so GOOGLE_CLOUD_PROJECT
is no longer needed. Added information about application default credentials to the argparse parser's description, so it shows up on python diff_gcs_directories.py --help
.
Similar to #15, this script compares two GCS "directories" to identify objects that are only present in one or that exist in both but contain different content.
This uses the GCS Python API to avoid the overhead of running gsutil for every API call. More importantly, this gets objects' MD5 checksums from the list objects API calls instead of making individual API calls for each object.
Testing this with a copy of the gnomAD v2 constraint data (a Hail Table with ~1,000 partitions) took ~50 minutes with the script in #15 vs ~2 seconds with this script.
Usage:
python diff_gcs_directories.py gs://bucket/path/to/directory gs://bucket/path/to/another/directory
.To run locally, set
GOOGLE_APPLICATION_CREDENTIALS
andGOOGLE_CLOUD_PROJECT
environment variables.Resolves broadinstitute/gnomad_production#171.