delph-in / erg

English Resource Grammar
MIT License
17 stars 3 forks source link

migrating from SVN to GitHub #40

Open goodmami opened 1 year ago

goodmami commented 1 year ago

This issue concerns difficulties in importing the ERG from SVN to GitHub. See comments on this gist for some context. One question from that thread:

[...] is GitHub's SVN importer not sufficient?

I can now answer that:

This repository is too large.

This might also be an issue with a manually converted repository. If so, we might need to consider storing big things like profiles and compiled .grm or .dat files in a separate repo.

arademaker commented 1 year ago

Yes, see my comments in gist.

oepen commented 1 year ago

hiya!

importing the ERG from SVN into GitHub is no small project, i imagine. ERG history goes back to around 1994, and there has been a long tradition for storing large binary files interspersed with the source files (owing in part to its centralized design, SVN works fairly well on binary files).

i imagine some repository surgery and retroactive refactoring may be called for. if it helps, i could probably make available an SVN dump file (filtered to just include everything below the ERG directory). but for that to make sense, i think we would first have to declare the ERG in SVN read-only, i.e. establish agreement with dan that there are no pending commits and that all future development will be against GitHub.

arademaker commented 1 year ago

I have finished the first part of the migration.

  1. Using git svn I cloned the SVN repo
  2. I updated the tags and branches but later branches were deleted (after @danflick confirmed they are not needed)
  3. this repo was updated with the git push --all --force

I am attaching the script I used, the logs with the steps, commands and outputs, the old README file and the references I followed.

transfer.zip

In the README.org I enumerated the nexts steps.

Keeping in mind that we ignored the files:

"\.mem$|\.grm$|edge$|result$|\.gz$|\.dat$"

Next we need:

1. the profiles will be moved to a separated repository. Dan agree
   that better would be to manually go over the SVN commits, take the
   specific versions of each release and construct a git repository
   manually recreating the important snapshots over the history.

2. We need to revise the tags and make releases in the repo to reflect
   the ERG's history. Tags are now poiting to commits disconnected
   from the branches. See
   https://github.com/delph-in/erg/releases/tag/2018 for example
   (click on the commit hash).

3. We need to attach to each release the big files that we didn't want
   to keep under version control (mem files, the maxent models)
goodmami commented 1 year ago

Thanks, @arademaker. Rather than pruning out just the larger edge and result files from the profiles, resulting in unusable profile artifacts in the repo, I assumed you might prune out the entire tsdb/ subdirectory and make a separate repo for it. If these were pruned out from the beginning, it should make this repo size significantly smaller. I understand that this task may be easier said than done, however.

goodmami commented 9 months ago

Next we need:

  1. the profiles will be moved to a separated repository. Dan agree that better would be to manually go over the SVN commits, take the specific versions of each release and construct a git repository manually recreating the important snapshots over the history.

@arademaker, do we have the [incr tsdb()] profiles available anywhere?

arademaker commented 9 months ago

At the beginning of the year, @danflick and I discussed the issue with the profiles. I do not remember now what his final decision was, but one approach I suggested was to have a separate repo for them.

arademaker commented 9 months ago

Sorry, what I wrote above is precisely what I remoted in the previous comment. I don't know the current status; @danflick left Brazil with a complete step-by-step to finish the migration, but he needs time to revise the data before the final migration.

danflick commented 9 months ago

I have the [incr tsdb()] profiles for each of the releases for the past 15 years, and would appreciate guidance on how best to organize those files on Github to enable convenient packaging of the releases, including the most recent 2023 version.

goodmami commented 9 months ago

@danflick, some suggestions:

For this erg repository, I would also just remove the whole subtree under tsdb/gold/. Currently the profiles are all there except for the large files, which is confusing. And similar to the last two points above, you can make ERG releases on GitHub use CI scripts to build and attach the .grm, .dat, and any other large files as assets.

danflick commented 1 month ago

Thanks, @goodmami, for the offer to help in using CI scripts to attach assets to releases. For the 2023 release that I just put together, I have stayed with the gold profiles in compressed form and still in erg/tsdb/gold, but those profiles are once again complete. There are so many changes in those profiles from release to release that using commit deltas on the uncompressed files doesn't seem useful, and it's much more convenient for my workflow to keep them where they are, and compressed. I hope they won't be too annoying.

I have stored the large redwoods.mem file using LFS, and it is now included in the 2023 release. I don't yet see how to package in the .dat or .grm files, since they are larger than 100M, but once someone has obtained the source, it's just a one-line command to produce those two files. I also think I would rather present the .dat and .grm files (and eventually a combined LKB-FOS+ERG binary) as ones that can be downloaded separately so people who just want to use the compiled grammar can get it in one of those three forms without also getting the source, if that's possible. I'd be glad for advice on how to do that.

goodmami commented 1 month ago

@danflick That's ok for the compressed gold profiles. As long as you only update them in GitHub for releases and not regularly in between releases, it probably won't bloat the repository size too much.

I have stored the large redwoods.mem file using LFS, and it is now included in the 2023 release.

Ok this is good. It might be good to provide some documentation (maybe with a revised README?) on how to retrieve this.

I also think I would rather present the .dat and .grm files (and eventually a combined LKB-FOS+ERG binary) as ones that can be downloaded separately so people who just want to use the compiled grammar can get it in one of those three forms without also getting the source, if that's possible. I'd be glad for advice on how to do that.

I've submitted #51 to add a GitHub Action CI script that will compile the grammar from a tagged release and upload the .dat file to that release. Is this what you are looking for?

arademaker commented 1 month ago

As long as you only update them in GitHub for releases and not regularly in between releases, it probably won't bloat the repository size too much.

IMHO, the way to work with the profiles may need more thought. Of course, @danflick will be the ultimate person to decide how his workflow works best for him. But I would encourage a setup with frequent commits. This also helps preserve data if @danflick's laptop has any issues.

Regarding the compiled files, definitely adding them to the releases makes much more sense, and @goodmami'srtainly solution was excellent.

goodmami commented 1 month ago

But I would encourage a setup with frequent commits.

I don't think there was any misunderstanding, but to be clear I would also encourage frequent, atomic commits for changes to TDL files, configs, etc. I was only suggesting that changes to large binary files, such as gold profiles, be saved for release commits.

arademaker commented 1 month ago

I was only suggesting that changes to large binary files, such as gold profiles, be saved for release commits.

I got it, and I agree. The question is how easy and safe it is for @danflick to use this workflow. He will have to carefully avoid including the changes of big files in every commit, but between the commits, if he changes these big files and something happens in his machine, he will lose data, right? I am unsure of the best solution; I am trying to make us think about possible problems.

goodmami commented 1 month ago

The question is how easy and safe it is for @danflick to use this workflow.

That's up to Dan, but I'm not sure it's very different from how he usually does it. And it's not the case that all profile changes have to be done in the same commit; each profile could be done in its own commit. It's just that any change to the binary file causes a copy of the whole thing to be stored in the history, so frequent, small changes to the same file could cause a problem.

Anyway I don't want to over-complicate things. I'm just trying to avoid hitting a repository size limit.

goodmami commented 1 month ago

@danflick Sure enough, there was a bug in the script in #51 which wasn't caught until I merged it into main. I committed a fix and pushed it directly to main (aside: we can turn on branch protection rules if you want to force all commits to go through a PR before merging into the main branch).

The script ran and uploaded the compressed .dat file to the release. See here: https://github.com/delph-in/erg/releases/tag/2023

The screenshot below also shows how to run the script with the order of clicks numbered in purple.

gh-release-action

danflick commented 1 month ago

The result looks good, and the workflow screenshot will be helpful for doing the next (2024) release.