GenericMappingTools / gmt

The Generic Mapping Tools
https://www.generic-mapping-tools.org
Other
851 stars 356 forks source link

Proposal for managing test baseline images using data version control (dvc) #5724

Closed maxrjones closed 11 months ago

maxrjones commented 3 years ago

Proposal for managing test baseline images using data version control (dvc)

This issue proposes a solution to #3470 and a partial solution to #2681 by using data version control to manage the baseline images for testing. @weiji14 led an effort to move PyGMT's tests from git version control to data version control with remotes stored on DAGsHub in https://github.com/GenericMappingTools/pygmt/pull/1036; most of the information here is from Wei Ji's posts for PyGMT (thanks! šŸ™ šŸŽ‰ ).

Motivation for migrating baseline images to dvc

Here's the current breakdown for the GMT repository:

The fact that the overall repository size increased by 50% over the past 1.5 years while individual directories have remained the same size supports past developer comments that the repository growth rate due to rewriting PS files is unsustainable.

What is data version control

Data version control (dvc) is an open source tool for managing and versioning datasets and models. It is built on Git with very similar syntax. Rather than storing bulky images in the repository, small .dvc files are stored that contain metadata, including the md5 hash for the data file. This allows versioning of data files that are stored in a remote location. Options for remote storage include S3, Google cloud, Azure, SSH server and DAGsHub (PyGMT uses DAGsHub).

Steps required

(Based on PyGMT, may need some updating)

Initial setup (only needs to be done once for the repository)

Installing DVC for developing GMT

Initialize dvc

dvc init # creates .dvcignore file and .dvc/ folder
# remove .dvc/plots folder as won't be used
# Optionally configure the repository to not send anonymous usage data
# git add only the .dvcignore, .dvc/.gitignore and .dvc/config file
git add .dvcignore .dvc/.gitignore .dvc/config
git commit -m "Initialize data version control"

Setup DVC remote

dvc remote add origin https://dagshub.com/GenericMappingTools/gmt.dvc # updates .dvc/config file with remote URL
dvc remote default origin  # set default dvc remote to 'upstream'

Migrating tests

(based on PyGMT steps, may need updating)

# Sync with git and dvc remotes
git pull
dvc pull
# Generate hash for baseline image and stage the *.dvc file in git
git rm --cached 'test/<test-folder>/<test-image>.ps'
mv test/<test-folder>/<test-image>.ps test/baseline/<test-folder>/<test-image>.ps
dvc add test/baseline/<test-folder>
git add test/baseline/<test-folder>.dvc test/baseline/.gitignore
# Commit changes and push to both the git and dvc remotes
git commit -m "Migrate test to DVC"
git push
dvc push

Pull images from DVC remote (for GitHub Actions CI and local testing)

dvc status # should report any files 'not_in_cache'
dvc pull # pull down files from DVC remote cache (fetch + checkout)
cd <build-dir>
ctest

What about the images for documentation?

Test directory is currently much larger than the documentation directory. So, migrating the tests will be a large first step that does not require an established solution for the documentation images. Regardless, my opinion is that we should host the examples/tutorials/animations in a separate repository (https://github.com/GenericMappingTools/gmt/pull/5364#issuecomment-906679727).

References

Are you willing to help implement and maintain this feature? Yes

maxrjones commented 3 years ago

Based on the discussion at the last community meeting, I will start the migration of the baseline images to dvc using DAGsHub for storage.

weiji14 commented 3 years ago

Cool, let me know if you need any help :grinning:

Just a note on storage limits. According to https://dagshub.com/plans, DAGsHub provides up to 10 GB of free space. So I think <200MB from GMT is ok for now (PyGMT probably has <15MB on DAGsHub), but just something to keep in mind when uploading those large PS and video files.

image

joa-quim commented 3 years ago

What does it mean (for us)? image

maxrjones commented 3 years ago

What does it mean (for us)? image

Good question. I just checked on their community forum and that limit only applies to private projects, so it does not mean anything for us.

joa-quim commented 3 years ago

Good, thanks.

maxrjones commented 3 years ago

As an update, I have learned that dvc works best by tracking directories rather than individual files when large numbers of files need to be added (we currently have 779 .ps files). I am going to try out restructuring the tests so that rather than having .ps files paired with the .sh files in test/**/*.ps there is a single test/baseline/ directory that can be dvc added. I'll test this out in my fork of the gmt repository.

Since the DAGsHub interface supports viewing png files, I am also going to research whether using .png files rather than .ps files will impact performance.

weiji14 commented 3 years ago

As an update, I have learned that dvc works best by tracking directories rather than individual files when large numbers of files need to be added (we currently have 779 .ps files).

Does tracking a directory mean computing a hash for the whole directory? A bit concerned with what this means if different people are trying to modify different PS files on multiple branches.

Edit: Looking at https://dvc.org/doc/command-reference/add#example-directory, it seems that running dvc add on a directory will indeed produce a single test/baseline/directory.dvc file with a single md5 hash. This directory.dvc file might be a source of multiple merge conflicts.

maxrjones commented 2 years ago

@PaulWessel, for PyGMT we bundle up the test images at release time and include that as an asset for the github and zenodo releases. Do you think this is desirable for GMT as well?

Benefits -

Downsides

joa-quim commented 2 years ago

There are 104 MB of files in test\baseline, is this what you are referring?

maxrjones commented 2 years ago

There are 104 MB of files in test\baseline, is this what you are referring?

Yes. Not to go in the source tarballs, bundles, or windows installers. Just zip up that when we do a release and archive it somewhere.

joa-quim commented 2 years ago

Right, backups are never a bad idea. The PS files should compress significantly.

PaulWessel commented 2 years ago

I agree, good think to do for self-preservation.

seisman commented 12 months ago

As an update, I have learned that dvc works best by tracking directories rather than individual files when large numbers of files need to be added (we currently have 779 .ps files).

Does tracking a directory mean computing a hash for the whole directory? A bit concerned with what this means if different people are trying to modify different PS files on multiple branches.

Edit: Looking at https://dvc.org/doc/command-reference/add#example-directory, it seems that running dvc add on a directory will indeed produce a single test/baseline/directory.dvc file with a single md5 hash. This directory.dvc file might be a source of multiple merge conflicts.

Tracking directories has caused a lot of troubles for us recently. For example, all the PS files of the 52 examples are DVC-tracked in a single DVC file (i.e., doc/examples/images.dvc). Currently, its content is:

outs:                                                                           
- md5: 4dd0ad31844cb0b0b451648cda314e2a.dir                                        
  size: 37295153                                                                   
  nfiles: 53                                                                    
  path: images                                                                  
  hash: md5 

The troubles are:

So, tracking directories is not a good choice for us. As I understand about @maxrjones's comment https://github.com/GenericMappingTools/gmt/issues/5724#issuecomment-917093670, tracking a large number of files is not efficient, but are 1000 files too much? After reading some upstream DVC issues and it seems they are talking about the inefficiency for 10k to millions of files. The number of GMT's PS files will definitely increase, but I don't think we will have more than 2000 files in the next few yeas. So maybe we should try tracking individual files instead?

maxrjones commented 12 months ago

As I understand about @maxrjones's comment https://github.com/GenericMappingTools/gmt/issues/5724#issuecomment-917093670, tracking a large number of files is not efficient, but are 1000 files too much? After reading some upstream DVC issues and it seems they are talking about the inefficiency for 10k to millions of files. The number of GMT's PS files will definitely increase, but I don't think we will have more than 2000 files in the next few yeas. So maybe we should try tracking individual files instead?

As the .dvc files are so small and we can always purge images from the dagshub repo, the only real risk here seems to be the amount of time it would take to try this and go back if necessary. I would guess it would take a couple hours of work to go from the current structure to tracking individual files, and likely about the same to go back if it turns out to be more of a headache. Seems worth trying IMO given the recent frustrations.