Open nomennescio opened 5 years ago
I noticed for .92 and .93 that the commit ids were present in factorcode.org but not in github. Something is pretty messed up already. I also had github do a git-gc or something a few years ago because we basically had duplicate commits everywhere. At this point, I don't really have a good understanding of the repo.
One thing going on is that factorcode.org does a force push to github every 5 minutes. I guess I don't see the point of doing it this way anymore, and changing to github as the truth source would let us use the github ui to merge patches.
cat force-push-github
#!/bin/sh
export PATH=$PATH:/usr/local/bin/
cd /git/factor.git
git push -f git@github.com:factor/factor.git 'refs/tags/*' 'refs/heads/*'
Duplicate commits is a sure sign of a partial rewrite of history, for instance caused by people not updating their local archive after a rewrite, and then pushing their old branch, which will pull in all old commits again.
All is not lost. Somewhere after tag 0.93 the branch need to be 'cut open', and the '0.96' tag needs to be reinserted into the main branch. The bad news is, is that it requires a rewrite of history, or maybe it can be avoid using git replace objects. This can also be used to solve the zeropaddedfilemode issue, for which the currently prescribed 'fix' is to do a fast-export - fast-import combo.
This would need to be coordinated, where everyone finishes their commits on branches, push it remotely, and after a rewrite of the archive, make a fresh clone again.
If you're considering this route, I would be interested in rewriting the first git commit to point to the older archive history, which I have already prepared.
This can be repaired, but some pain will be felt.
As a bonus, during reconstruction of Git history, I''ve also correlated ALL published source archives with the git history. I already discovered that some releases did not exactly appear in the master branch. However, I did find the commits that were the least different from these releases. Now I understand why it happened; code was committed on a dangling branch and used for release. On the bright side, these similar commits are probably very near were the real merge points should be.
Looking at some of these commits it looks like there's real duplication
Compare e.g. e4cc936c55 with afb2a6ea
d2962585 with db359d6
905ec06 with 2a8af32
I think having one repository with the full history, without any duplicate commit ids, and without the boot images checked in (they are at downloads.factorcode.org) would outweigh the negatives. I'm also in favor of using github or gitlab as the official repo and mirroring it at factorcode.org, instead of the push script we have now (above).
As for the boot images, these are included in the source code snapshots, which I extracted, even back to 2004, and only for the official releases. I think Git compresses these pretty well, as for releases 0.66 up till 0.98, for all platforms, the total Git database size is only 280 MB. It has the benefit that users can bootstrap a release if they've cloned the repo. And for users not interested, these release are on the release branch, and need not to be checked out.
If you think it is really important, I think I can try to run a filter-branch and remove them and see if it's worth the effort. History rewriting for files is dead-slow.
If we would go forward with a full history rewrite, I'm not sure what the best approach would be. Suppose I would give it a go from a fork, I'm not sure a pull request would be the best way to deal with such a rewrite. Do you have any suggestions/preferences?
@erg I propose I create a pull request for the Git history restoration work I did until now. It might be improved if the dangling branch gets rewritten, in which case the history can be rewritten in one go.
See my comment on #2190 for Github's failure to create pull requests for detached commits.
My proposal is now to first fetch my paleo-history for #2190 into the main archive and then do a reconstruction/rewrite for this issue. Rewrite is probably required for even just fixing the zeropaddedfilemode errors.
Proposal to do the fix:
The fix should solve the git fsck
errors, as well as the duplicate nodes and any other encountered issues. A full rewrite of the repo is inevitable. That requires some coordination. I suggest the following substeps
git replace
objects (non-destructive)fast export | fast import
to rewrite all objects locallysteps 5-7 are to prevent intermediate changes, and to force people to clone (I think that would be required) the repo. It is in line with the recommendation of git filter-repo to rewrite into a new url to prevent major pains. It still preserves the old archive temporarily. These steps can be done with some time in between them to introduce some safety, at the cost of some inconvenience. The inconvenience is mostly for contributors who have pending work.
The only thing important to coordinate then is that developers are aware of 5-7, and this is clearly communicated. Furthermore, no pull requests should be pending, and new pull requests should only be made on the new archive, hence everyone should clone the new repo. They need to transport their local work onto the new archive.
Furthermore, I should ideally be able to work on fixing the archive with no pending pull requests, although with the replace objects for historic commits, it might not have a big impact.
If we do it like this, I think we can also prevent future issues due to merges using out-of-date clones.
The alternative would be to use a git push --force
, with the very real risk of errors from people's existing archives getting merged back into the new archive, making things potentially even worse. This is explicitly discouraged by git filter-repo
.
The benefit of a full rewrite, is that all replace objects are not needed anymore, and users automatically get a full history.
As a reminder: if there is still any chance of recovering parts of Factor's history currently missing, it should preferably be done before this whole rewrite.
And on a different note: you wrote you wanted to make Github the leading Git repo, not the one on Factorcode, where you git force push towards Github every 5 mins. Just to be sure: the above approach requires Factorcode is not force pushing towards Github.
What do you think? Do you see any issues with the above approach?
When preparing a pull request for the (partially) restored source code history of Factor before the current first git commit. I encountered an inconsistency in the current git archive: Tags 0.96, 0.95 and 0.94 point into a dangling branch, and are potential dangling commits! It is caused by commits that were never merged back into the main branch. The divergence starts somewhere after 0,93. If you point at tag 0.93, you'll see in Gitk: Follows: 0.92 Precedes: 0.97
When delving further into this, I used gitk to find 'Follows' and 'Precedes' tags, which gives the following ordering of commit tags on two different lines of history (commits):
similar-0.91 .. 0.92 .. 0.93 .. similar-0.94 .. similar-0.95 .. similar-0.96 .. 0.97 ..
and
similar-0.91 .. 0.94 .. 0.95 .. 0.96
where 0.96
is the last commit on a (dead) "branch".
so somewhere after similar-0.91
, the commit history splits, with 0.94..0.96
on a dead branch and NOT present on the "main" branch, which continues beyond 0.97
. I added the similar-*
, these were originally not in the repo, but mark the commits most similar to exported and published snapshots of the repo.
Because of the force push after .96 that assigned new sha refs to history. We need to assign those tags to the clean history
On Sun, Jan 22, 2023 at 1:02 PM nomennescio @.***> wrote:
When preparing a pull request for the (partially) restored source code history of Factor before the current first git commit. I encountered an inconsistency in the current git archive: Tags 0.96, 0.95 and 0.94 point into a dangling branch, and are potential dangling commits! It is caused by commits that were never merged back into the main branch. The divergence starts somewhere after 0,93. If you point at tag 0.93, you'll see in Gitk: Follows: 0.92 Precedes: 0.97
When delving further into this, I used gitk to find 'Follows' and 'Precedes' tags, which gives the following ordering of commit tags on two different lines of history (commits): similar-0.91 .. 0.92 .. 0.93 .. similar-0.94 .. similar-0.95 .. similar-0.96 .. 0.97 .. and similar-0.91 .. 0.94 .. 0.95 .. 0.96 where 0.96 is the last commit on a (dead) "branch".
so somewhere after similar-0.91, the commit history splits, with 0.94..0.96 on a dead branch and NOT present on the "main" branch.
— Reply to this email directly, view it on GitHub https://github.com/factor/factor/issues/2197#issuecomment-1399607413, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF5A6D573HKVX2RASNXHTWTWN7RANCNFSM4JC7Z7HA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
After digging some more, I found the commit which precedes both 0.92
and 0.94
, after which the "branch" splits into a "live" and "dead" branch:
028235b9ffc8972bbf74d41eee1ef970ac01d007
Interestingly enough the first commits (children) after this commit are identical, but have different SHA1. This the smoking gun of an incorrect push --force
, the dread of all git repos.
Because of the force push after .96 that assigned new sha refs to history. We need to assign those tags to the clean history
I'll be further investigating this to see where the "real" branch happens, i.e. at which point a rebase happened. There I can try to make a fixup.
I think you already did it with the similar-#.## tags, but:
0.94 tagged db359d69dfe0f24613f9a8ec4f6ac3a0b3d87980 should be d29625850f8197d5898cbcc4ddd8d505ccdc32f2. 0.95 tagged e4cc936c55d9946698abd266f673ba8c06b5e19e should be afb2a6eabb63ae50678e65564f89aa54de1d5130 0.96 tagged 2a8af325347d5e90ce874f706f5746cd0ddaac9b should be 905ec06d864537fb6be9c46ad98f1b6d101dfbf0
Why are there other similar tags before 0.94?
$git diff similar-0.96 0.96
is empty, meaning I indeed already tagged "same" commits. It also implies about 3-4 years of commits were duplicated by a push --force
.
As the 0.96
is completely identical, and dead, it can be removed, and the tags moved to the other branch. After the rewrite that will lead to a clean history that has removed this error. I will add this to the 'to-do' list of the rewrite. I think that concludes this investigation. After the rewrite, this issue can be closed.
Why are there other similar tags before 0.94?
The other similar tags I added were what source code commits had the least difference with exported snapshots. Unfortunately they were different from the official tags.
I updated Github Releases to use similar-0.9x
tags for 0.9x
releases (0.94, 0.95, 0.96)
Unfortunately Github reorders the releases on the page, but no further impact.
I deleted tags 0.94
0.95
and 0.96
.
For reference, here are the original SHAs (will be removed in rewrite):
$ git show 0.94 | head -2
commit db359d69dfe0f24613f9a8ec4f6ac3a0b3d87980
Author: Doug Coleman <doug.coleman@gmail.com>
$ git show 0.95 | head -2
commit e4cc936c55d9946698abd266f673ba8c06b5e19e
Author: Doug Coleman <doug.coleman@gmail.com>
$ git show 0.96 | head -2
commit 2a8af325347d5e90ce874f706f5746cd0ddaac9b
Author: Doug Coleman <doug.coleman@gmail.com>
After the rewrite I will make new 0.94
, 0.95
and 0.96
tags. I will also rewrite all current 'normal' tags into annotated tags. This has no impact other than that these tags will have a proper identity and live as objects, not as mere labels.
To clarify: use similar-0.9x
for 0.9x
until the rewrite (x in [4..6]
)
When preparing a pull request for the (partially) restored source code history of Factor before the current first git commit. I encountered an inconsistency in the current git archive: Tags 0.96, 0.95 and 0.94 point into a dangling branch, and are potential dangling commits! It is caused by commits that were never merged back into the main branch. The divergence starts somewhere after 0,93. If you point at tag 0.93, you'll see in Gitk: Follows: 0.92 Precedes: 0.97
When running
git fsck
on these commits, I also get lots ofzeroPaddedFilemode: contains zero-padded file modes
errors, which is also bad.This is a serious issue, as it is quite nasty to try to reconstruct a tree with a corrected mergepoint, it will need rebase or filter-branch (the latter is cleaner).
On top of that, I have to redo a lot of my work on reconstructing Git paleo-history, and will not be able to create a pull-request until this is properly solved.
I'm willing to help to fix this.