cwbi-apps / access-request

Repository for tracking access requests for CWBI Apps
0 stars 0 forks source link

File limit size of 100MB at GitHub #31

Closed willbreitkreutz closed 6 months ago

willbreitkreutz commented 6 months ago
          @willbreitkreutz  - It looks like there is a file limit size of 100MB at GitHub.  We have a repository "orm_data_change" that won't migrate because there are several files over that size limit. Do you have any guidance on handling repositories with large objects?

Originally posted by @mark-english in https://github.com/cwbi-apps/access-request/issues/22#issuecomment-2088463329

willbreitkreutz commented 6 months ago

@mark-english I'm curious what kind of files those would be. Moving this to a separate issue for discussion.

mark-english commented 6 months ago

@willbreitkreutz BLUF - I was able to handle the files during the migration. Usually by splitting them into multiple smaller files, then removing the original large file from history using 'git filter-branch'.

The large files are not application code. They are files related to user data.

The ORM database is used to manage a lot of spatial data. One function that the ORM team provides users is integrating spatial data gathered outside of the ORM application into the database. Users provide data in spreadsheets, CSV, zip files, etc. and as a routine we keep track of what is provided, the database process used to enter/modify the data, and the end result of that process. In some cases the spatial data provided can be rather large and/or the resulting log files generated from the database process can be rather large. During the migration I came across 5 of these cases.

rthadr commented 6 months ago

git-lfs plugin takes care of this (and github supports it). there is a git lfs migrate command that can patch up history as well.

willbreitkreutz commented 6 months ago

@rthadr so, git-lfs does work, but we should talk, I didn't realize you guys were already using it. The org has a fairly small cap for large file storage over which it gets billed on top of the included costs. Right now we're already 3x the cap...

I just want to make sure we're not using GitHub for archiving or backing up data, focusing on using it for code. For data we should be putting it in S3 / Glacier for long term archive.

If we have larger binary files that do make sense to include in repositories we just need to make sure we identify those needs and track them since there are cost implications.

willbreitkreutz commented 6 months ago

@mark-english meant to reply sooner, thanks for the update, glad you were able to get around the issue.