chanzuckerberg / single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.
MIT License
64 stars 14 forks source link

Clean up the data portal repo (git history + main branch) #5564

Open ebezzi opened 1 year ago

ebezzi commented 1 year ago

Our data portal git repo is much bigger than it needs to be, to the point where cloning it takes several minutes. This is a problem for GHA and for developers' productivity. To solve this issue, we should:

  1. Make sure that no data is checked in in the repo. This is a bad practice and needs to be discontinued. A good place for storing them is a private S3 bucket. If GHA needs access to this bucket, add it to single-cell-infra/terraform/modules/hosted-cellxgene/iam.tf
  2. Once those huge files have been removed from the main branch, all the other existing branches need to be deleted, since they will have a lot of files.

Note that as a mitigation strategy, it is possible to specify --single-branch when cloning the data portal, but this should only be used as an emergency.

metakuni commented 8 months ago

@ebezzi / @atolopko-czi : Close this as a duplicate of #5682 ?

atolopko-czi commented 8 months ago

This was a particular problem, performance-wise, for CI/CD GHAs that were fetching full git histories. I believe they're now only fetching 1 or 2 levels, so it shouldn't be a problem for CI now. But if so, this is a larger reason to fix.

Developers' local git repo clones are not checked out "fresh" very frequently. However, I have run into large updates for even "git pulls", though maybe that was only after large data files had been updated.