broadinstitute / tgg_methods

Repo for miscellaneous methods developed by the methods group that don't fit anywhere else
MIT License
4 stars 0 forks source link

Add script to compare objects in GCS directories #28

Closed nawatts closed 3 years ago

nawatts commented 3 years ago

Similar to #15, this script compares two GCS "directories" to identify objects that are only present in one or that exist in both but contain different content.

This uses the GCS Python API to avoid the overhead of running gsutil for every API call. More importantly, this gets objects' MD5 checksums from the list objects API calls instead of making individual API calls for each object.

Testing this with a copy of the gnomAD v2 constraint data (a Hail Table with ~1,000 partitions) took ~50 minutes with the script in #15 vs ~2 seconds with this script.

Usage: python diff_gcs_directories.py gs://bucket/path/to/directory gs://bucket/path/to/another/directory.

To run locally, set GOOGLE_APPLICATION_CREDENTIALS and GOOGLE_CLOUD_PROJECT environment variables.

Resolves broadinstitute/gnomad_production#171.

mike-w-wilson commented 3 years ago

To run locally, set GOOGLE_APPLICATION_CREDENTIALS and GOOGLE_CLOUD_PROJECT environment variables.

Sorry missed this bit in the review but could this be put in the script? A comment or maybe in the docstrings?

nawatts commented 3 years ago

Is there a reason you use print here instead of creating a logger?

As a user, I'd appreciate some sort of success message. Won't both 1 and 0 just end silently here?

These were both motivated by making the script work in combination with other scripts. Exiting 0 on success and non-zero on an error is conventional for UNIX programs (https://tldp.org/LDP/abs/html/exit-status.html). Adding a success message can be done with:

python diff_gcs_directories.py gs://bucket/path gs://bucket/other/path && echo "Directories are the same"

Printing to stdout makes it easy to process the output by piping it to other tools. For example, to filter the output to only the paths of changed files:

python diff_gcs_directories.py gs://bucket/path gs://bucket/other/path | grep '^\*' | cut -c 3-

Also, since the script is expected to run quickly, the timestamps provided by logging aren't as useful as they are with a long running process.

mike-w-wilson commented 3 years ago

Gotcha, so if we run on multiple buckets the parent process can process the 0 or non-zero? That also makes sense with logging. Thanks for clarifying!

nawatts commented 3 years ago

Gotcha, so if we run on multiple buckets the parent process can process the 0 or non-zero?

Yes. A shell script could do something like:

if diff_gcs_directories.py $url1 $url2; then
    echo "Directories are the same"
else
    echo "Directories are different"
fi
nawatts commented 3 years ago

To run locally, set GOOGLE_APPLICATION_CREDENTIALS and GOOGLE_CLOUD_PROJECT environment variables.

Sorry missed this bit in the review but could this be put in the script? A comment or maybe in the docstrings?

Turns out the GCS client can be created without a project, so GOOGLE_CLOUD_PROJECT is no longer needed. Added information about application default credentials to the argparse parser's description, so it shows up on python diff_gcs_directories.py --help.