dalehenrich / filetree

Monticello repository for directory-based Monticello packages enabling the use of git, svn, etc. for managing Smalltalk source code.
https://github.com/CampSmalltalk/Cypress
MIT License
133 stars 26 forks source link

Add means to split monticello version files #110

Closed krono closed 9 years ago

krono commented 11 years ago

The version files for monticell (located in monticello.meta for cypress) are rather huge and, being one long string, not really suited for VCS/SCM like git, svn, or hg.

With this feature, the version file is split into a 'verison.d' directory containing one version file per ancestor/stepChild. The original version file simply includes the UUID of the current version, to be retrieved from the dir.

The big advantage is, that upon (git-)merges, not the whole version information is confilcted, but only the 'head' UUID stored in the version file, which makes merging easier, regardless of the VCS/SCM used. (We could even automate such merges on Smalltalk side)

A new monticello.meta dir might look like this:

.
├── categories.st
├── initializers.st
├── package
├── version
└── version.d
    ├── Cypress-Mocks-dkh.10_ad637421-2d36-448b-a56d-34b9fdf3f7a8
    ├── Cypress-Mocks-dkh.11_c444a6b8-868d-4796-9b9f-0892f416366c
    ├── Cypress-Mocks-dkh.12_69f2b830-b40e-4307-9589-ab2ec410cc09
    ├── Cypress-Mocks-dkh.13_a84c5a36-0447-472b-b980-b6f048d9c295
    ├── Cypress-Mocks-dkh.14_b6a04043-a012-4ec6-ac55-b02217efd4cd
    ├── Cypress-Mocks-dkh.15_f842c337-9b2a-4e62-a356-b76bc8542185
    ├── Cypress-Mocks-dkh.16_32dcfd04-aa93-48c9-a19e-c9eaea018e32
    ├── Cypress-Mocks-dkh.1_a5097e9e-1888-4e9d-bba6-c142add676db
    ├── Cypress-Mocks-dkh.2_cf98f4fb-cc54-465f-a770-22008f435b9a
    ├── Cypress-Mocks-dkh.3_fabc5fcb-ef59-4095-8e64-0172fda79cab
    ├── Cypress-Mocks-dkh.4_532f2be9-e166-404a-8ee7-782c1d5d14e0
    ├── Cypress-Mocks-dkh.5_195bdd19-051b-4ba6-afd4-3db4561d57fa
    ├── Cypress-Mocks-dkh.6_45050c08-be9e-4fba-93c6-fcb5539a6635
    ├── Cypress-Mocks-dkh.7_0cd353a1-d9bd-48f5-93c8-3f69a2684c27
    └── Cypress-Mocks-dkh.9_14c92a80-61f9-47ef-8cf8-39302dd087b1
ThierryGoubier commented 11 years ago

Could we just remove the version file instead? gitfiletree:// is not reading the version file anymore and knows how to rebuilt the version history out of git. I suspect anybody interested in cvs, svn or hg could do the same.

krono commented 11 years ago

@ThierryGoubier When you are specific to a certain SCM, like you with git, this is certainly true.

But currently, FileTree by itself is SCM agnostic, with the monticello.meta extension, it is a file system mirror of Monticello; FileTree itself knows nothing about file-based SCMs, and I tend to think this is currently intentional (@dalehenrich, am I right here?). This is just a feature to further help any SCM to better handle the status quo.

krono commented 11 years ago

@ThierryGoubier GitFileTree currently only works in Pharo, right?

ThierryGoubier commented 11 years ago

@krono , yes, gitfiletree is developped and tested only in Pharo(2.0 and I hope 3.0 soon). What it rely on is not Pharo specific (OSProcess) but I'm not able to develop on other implementations.

When @dalehenrich and I discussed about that, we were looking into removing monticello.meta/version, that's all :) In the case of gitfiletree, the removal is already effective (also for method properties). Makes the code simpler (sort of), may clean the logs and diffs (version is redundant with your SCM logs) and ensure that if you are exploring the history, we really have all versions around :)

krono commented 11 years ago

Am 15.09.2013 um 19:25 schrieb Thierry Goubier notifications@github.com:

@krono , yes, gitfiletree is developped and tested only in Pharo(2.0 and I hope 3.0 soon). What it rely on is not Pharo specific (OSProcess) but I'm not able to develop on other implementations.

That is understandable. I just noticed that it wont load in, eg, Squeak because of FileSystem not yet available and String»#matchRegex: missing.

When @dalehenrich and I discussed about that, we were looking into removing monticello.meta/version, that's all :) In the case of gitfiletree, the removal is already effective (also for method properties). Makes the code simpler (sort of), may clean the logs and diffs (version is redundant with your SCM logs) and ensure that if you are exploring the history, we really have all versions around :)

Yes, I understand. But if you have no means in place to extract that information from the file directory (eg, no SCM at all or with dropbox), you miss that entire information. I think here it is better have it than need it…

If monticello.meta/version is to go, I'm sad, but I'd say yes but not yet :)

Best -Tobias

frankshearar commented 11 years ago

Is String >> #matchRegex: in the Regex package? That package, last time I tried (within the last year) works/worked just fine in Squeak. (So it might be just a missing dependency.)

ThierryGoubier commented 11 years ago

@krono , I'm sorry for the use of FileSystem; I'm starting to notice I should have used fileUtilityClass? instead of FileSystem and I'll try to make a pass to clean that.

krono commented 11 years ago

Am 15.09.2013 um 22:23 schrieb Frank Shearar notifications@github.com:

Is String >> #matchRegex: in the Regex package? That package, last time I tried (within the last year) works/worked just fine in Squeak. (So it might be just a missing dependency.)

Cool. (OT: @frankshearar, probably that should be in trunk but unloadable?)

krono commented 11 years ago

Am 15.09.2013 um 23:12 schrieb Thierry Goubier notifications@github.com:

@krono , I'm sorry for the use of FileSystem; I'm starting to notice I should have used fileUtilityClass? instead of FileSystem and I'll try to make a pass to clean that.

No need to be sorry. FileSystem is just not yet fully integrated with Squeak. That's why Dale has this fileUtils abstraction.

krono commented 11 years ago

Again, regarding monticello.meta/version:

I think that

I mean, it is trivially possible for MCFileTreeGitStReader to omit the directory if existent, isn't it? And likewise, you could add a MCFileTreeGitStWriter inheriting from MCFileTreeStCypressWriter that does not even write such files/directories.

just my 2¢

ThierryGoubier commented 11 years ago

@krono this is exactly what MCFileTreeGitStReader is doing, i.e. not using the metadata files where it can infer the metadat from git. Whether it would be a single version file or a version.d directory won't matter.

But I do like using MCFileTreeStCypressWriter so that, if needed, you can still read the repository with a bare FileTree. It also ensures that the format understood by MCFileTreeGitStReader does not diverges from the FileTree format; as you probably have noticed, I didn't manage to write it as a direct specialisation of MCFileTreeStCypressReader and had to rework the internal API, which means that changes done in MCFileTreeStCypressReader impementation are not directly reused by MCFileTreeGitStReader and I have to port them :( In that case, a solution would be a per-repository setting in FileTree, turning on or off the metadata writing (and robust FileTree reading code to handle missing metadata; at the moment MCFileTreeStCypressReader only handles the lack of all metadata; if the monticello.meta/ directory exist without version in it, it fails :().

Another isssue with the version metadata is that it pollutes the SCM logs; a small change in a large package / class will create diffs with mostly metadata updates, and I know some users are complaining of that more than I do. Your solution should reduce that noise, I believe.

krono commented 11 years ago

Am 16.09.2013 um 09:24 schrieb Thierry Goubier notifications@github.com:

@krono this is exactly what MCFileTreeGitStReader is doing, i.e. not using the metadata files where it can infer the metadat from git. Whether it would be a single version file or a version.d directory won't matter.

Yes, I saw that :)

But I do like using MCFileTreeStCypressWriter so that, if needed, you can still read the repository with a bare FileTree. It also ensures that the format understood by MCFileTreeGitStReader does not diverges from the FileTree format; as you probably have noticed, I didn't manage to write it as a direct specialisation of MCFileTreeStCypressReader and had to rework the internal API, which means that changes done in MCFileTreeStCypressReader impementation are not directly reused by MCFileTreeGitStReader and I have to port them :( In that case, a solution would be a per-repository setting in FileTree, turning on or off the metadata writing (and robust FileTree reading code to handle missing metadata;

True. probably we can work such things into .filetree?

at the moment MCFileTreeStCypressReader only handles the lack of all metadata; if the monticello.meta/ directory exist without version in it, it fails :().

I see. My is backward-compatible in the sense, that when 'version.d' does not exist, the traditional 'version' file is used. Is that OK?

Another isssue with the version metadata is that it pollutes the SCM logs; a small change in a large package / class will create diffs with mostly metadata updates, and I know some users are complaining of that more than I do. Your solution should reduce that noise, I believe.

That is the initial intention, yes.

ThierryGoubier commented 11 years ago

@krono , have you tried your format on a package with a large number of versions (say, above 100) on a slow machine? Just to check if there is not too much impact in opening and scanning that many files.

For the idea of checking the version.d/ presence in the meta directory, it looks nice. It would also allow for partial versionning history, am I right ? (i.e. I could restrict writing just the top version and its direct ancestors and reloading would work fine?). There is some pressure from the Pharo side to reduce the history size to make released images smaller, pressure which results in a 2.0 release where the history of most packages is reduced to nothing.

krono commented 11 years ago

Am 16.09.2013 um 10:45 schrieb Thierry Goubier notifications@github.com:

@krono , have you tried your format on a package with a large number of versions (say, above 100) on a slow machine? Just to check if there is not too much impact in opening and scanning that many files.

To be frank, I have not.

For the idea of checking the version.d/ presence in the meta directory, it looks nice. It would also allow for partial versionning history, am I right ? (i.e. I could restrict writing just the top version and its direct ancestors and reloading would work fine?).

I think but can you give a precise example?

There is some pressure from the Pharo side to reduce the history size to make released images smaller, pressure which results in a 2.0 release where the history of most packages is reduced to nothing.

I understand.

ThierryGoubier commented 11 years ago

@krono , maybe your current version of FileTree-Core could be a good test case: there is, what, more than 160 versions in there :)

For history reduction, I would restrict to the current version and just the direct ancestor(s). The latter would then have no ancestors. The directory of your example would be :

.
├── categories.st
├── initializers.st
├── package
├── version
└── version.d
    ├── Cypress-Mocks-dkh.15_f842c337-9b2a-4e62-a356-b76bc8542185
    ├── Cypress-Mocks-dkh.16_32dcfd04-aa93-48c9-a19e-c9eaea018e32

In the case of a merge, you would have two ancestors, and hence three files in version.d.

You could then rebuilt the history, if needed, by going through the chain: if you're on pure FileTree, there is nothing you can do anyway about browsing the history; you can only hope that you can checkout the previous version of your package, and that your commit is a true package. I learned through gitfiletree that it often isn't because you will have done SCM-stuff like merging behind MC back :). gitfiletree use to have code to handle unreadable version files such as version files committed with merge conflicts inside.

If you're on a gitfiletree, hgfiletree, svnfiletree, xxxfiletree, then you have the full history. If you're on smalltalkhub or squeaksource or ss3, there you have the full history (unless you're in Pharo release packages :)).

krono commented 11 years ago

Am 16.09.2013 um 16:51 schrieb Thierry Goubier notifications@github.com:

@krono , maybe your current version of FileTree-Core could be a good test case: there is, what, more than 160 versions in there :)

Right

For history reduction, I would restrict to the current version and just the direct ancestor(s). The latter would then have no ancestors. The directory of your example would be :

. ├── categories.st ├── initializers.st ├── package ├── version └── version.d ├── Cypress-Mocks-dkh.15_f842c337-9b2a-4e62-a356-b76bc8542185 ├── Cypress-Mocks-dkh.16_32dcfd04-aa93-48c9-a19e-c9eaea018e32

In the case of a merge, you would have two ancestors, and hence three files in version.d.

You could then rebuilt the history, if needed, by going through the chain: if you're on pure FileTree, there is nothing you can do anyway about browsing the history; you can only hope that you can checkout the previous version of your package, and that your commit is a true package (I learned through gitfiletree that it often isn't because you will have done SCM-stuff like merging behind MC back :): gitfiletree use to have code to handle unreadable version files such as version files committed with merge conflicts inside).

If you're on a gitfiletree, hgfiletree, svnfiletree, xxxfiletree, then you have the full history. If you're on smalltalkhub or squeaksource or ss3, there you have the full history (unless you're in Pharo release packages :)).

Ok, as I thought. Yes, clearly, If the two version files have no ancestor information this is simple :)

dalehenrich commented 11 years ago

@krono my inclination is to eventually drop the Monticello meta data. You are right that the monticello meta data is not compatible with a VCS, but the reason I prefer to drop the meta data altogether is that the meta data is redundant ... git does a better job of tracking the ancestry on a method by method basis than monticello does ...

With that said, I would be willing to integrate these changes for for Squeak only.

Without having read the code in detail I am still concerned about what happens if one tries to read a FileTree repository that has been split using an implementation of FileTree that is not aware of the split convention ... I assume that at best we have completely broken the ancestry for the package and at worst we are unable to read the package ...

For the long term my preference is to replace the monticello meta data with git meta data ...

krono commented 11 years ago

Am 17.09.2013 um 22:46 schrieb Dale Henrichs notifications@github.com:

@krono my inclination is to eventually drop the Monticello meta data. You are right that the monticello meta data is not compatible with a VCS, but the reason I prefer to drop the meta data altogether is that the meta data is redundant ... git does a better job of tracking the ancestry on a method by method basis than monticello does ...

Clearly, git is better than such a directory. However, I easily can imagine scenarios (said dropbox scenario) where omitting this would inadvertently would lose the information.

With that said, I would be willing to integrate these changes for for Squeak only. Well I certainly don't want to create a Squeak-island for FileTree. Better no such feature than just for squeak.

Without having read the code in detail I am still concerned about what happens if one tries to read a FileTree repository that has been split using an implementation of FileTree that is not aware of the split convention ... I assume that at best we have completely broken the ancestry for the package and at worst we are unable to read the package ...

I have thought about that, and I will augment the PR so that this breakage won't happen.

For the long term my preference is to replace the monticello meta data with git meta data ...

wouldn't this render FileTree to be completely git-dependent? I would favor an SCM-agnostic "superclass implementation" with scm-specific specializations (think GitFileTree, HgFileTree)

That being said, if scm-agnosticy is to go, I don't want to withstand. Feel free to close :)

dalehenrich commented 11 years ago

@krono, I've always considered the monticello meta data being present in FileTree as a compatibility feature. The monticello meta data is there for the case where you need to move the package back into a monticello repository.

If one is using git or svn for managing your source then I expect one to be using the merge capabilities of the underlying SCM and in these cases the monticello meta data is a source of meaningless commit conflicts ... and when one gets a commit conflict in the version file, it is not possible to do the proper merge ... for monticello the version file has to be rewritten by a monticello program.

So in the end monticello and git (svn, etc.) do not mix well.

The Cypress format is scm agnostic, but in practice if one is using git and doing git merges, then the monticello meta data is a real problem and I'm not sure that your approach addresses that problem ...

I want to keep this pull request open and continue the discussion....

dalehenrich commented 11 years ago

Thinking about this further ... the FileTree repository by itself does support monticello meta data and will for the foreseeable future continue to support monticello meta data ... Other repository formats based on Cypress (like @ThierryGoubier GitFileTree) will exist and are free to tweak the disk formats ...

As I've said before I want and expect the disk format to evolve over time, so I shouldn't be standing in the way of such evolution.

On the other hand, it is important to maintain consistency within a particular format ... I don't want to break FileTree by changing the underlying format ... so if we can preserve compatibility across "versions" of FileTree then I don't want to stand in the way ... perhaps this new format will make it possible to correctly merge the version history when doing a git merge?

krono commented 10 years ago

(sorry for this very late reply)

perhaps this new format will make it possible to correctly merge the version history when doing a git merge?

This is the very intention of this PR :)

krono commented 10 years ago

And now also works on windows.

ThierryGoubier commented 10 years ago

@krono, I'm looking at the git merge conflict issue and I think there is another solution. If you're Ok testing it, it's at https://github.com/ThierryGoubier/GitFileTree-MergeDriver.

krono commented 10 years ago

On 30.04.2014, at 10:53, Thierry Goubier notifications@github.com wrote:

@krono, I'm looking at the git merge conflict issue and I think there is another solution. If you're Ok testing it, it's at https://github.com/ThierryGoubier/GitFileTree-MergeDriver.

I saw that you pushed this. I am curious but I currently cannot use it, as I have to support Squeak on Linux, OSX, and Windows, So I cannot use it yet, right?

The logic of your MergeDriver looks good to me (although the methods are a bit long for my taste).

ThierryGoubier commented 10 years ago

Well, there is nothing linked with OSProcess, so it should work on Windows as well (i.e. it does not use gitfiletree://). However, I need the current directory, so I tried to guess that the CD environment variable would give that to me in the case of Windows.

I'll refactor it when I'll be sure its the right solution... No need if the merge driver has to written back in C or Python.

krono commented 10 years ago

On 30.04.2014, at 11:27, Thierry Goubier notifications@github.com wrote:

Well, there is nothing linked with OSProcess, so it should work on Windows as well (i.e. it does not use gitfiletree://). However, I need the current directory, so I tried to guess that the CD environment variable would give that to me in the case of Windows.

Nice :)

I'll refactor it when I'll be sure its the right solution... No need if the merge driver has to written back in C or Python.

Yes, whatever way, I would like that :)

ThierryGoubier commented 10 years ago

2014-04-30 12:40 GMT+02:00 Tobias Pape notifications@github.com:

On 30.04.2014, at 11:27, Thierry Goubier notifications@github.com wrote:

Well, there is nothing linked with OSProcess, so it should work on Windows as well (i.e. it does not use gitfiletree://). However, I need the current directory, so I tried to guess that the CD environment variable would give that to me in the case of Windows.

Nice :)

There are a few dependencies on Pharo FileReference stuff, but I could write them on FileTree higher-level API instead. And GitFileTree is not working on Windows just because nobody has taken the offer of Eliot Miranda for the OSProcess plugin for windows... which says something about the need ;)

I'll refactor it when I'll be sure its the right solution... No need if the merge driver has to written back in C or Python.

Yes, whatever way, I would like that :)

I see it as two pieces:

From the point of view of GitFileTree, the merge-driver isn't even necessary. Switching the git merge policy for those files to 'binary' is enough, and using meld to cope with the .st conflicts is probably easy enough. I'll be trying that on some of my projects.

I'll write that up somewhere soon.

Thierry

krono commented 10 years ago

On 30.04.2014, at 12:55, Thierry Goubier notifications@github.com wrote:

2014-04-30 12:40 GMT+02:00 Tobias Pape notifications@github.com:

On 30.04.2014, at 11:27, Thierry Goubier notifications@github.com wrote:

Well, there is nothing linked with OSProcess, so it should work on Windows as well (i.e. it does not use gitfiletree://). However, I need the current directory, so I tried to guess that the CD environment variable would give that to me in the case of Windows.

Nice :)

There are a few dependencies on Pharo FileReference stuff, but I could write them on FileTree higher-level API instead. And GitFileTree is not working on Windows just because nobody has taken the offer of Eliot Miranda for the OSProcess plugin for windows... which says something about the need ;)

Or capacity, for that matter…

I'll refactor it when I'll be sure its the right solution... No need if the merge driver has to written back in C or Python.

Yes, whatever way, I would like that :)

I see it as two pieces:

  • a merge driver to reduce the number of conflicts (but I'm not jumping at the idea of writing it in C, to be honest. Python, maybe).
  • a merge tool (this one in Squeak or Pharo) to have a git merge tool which is aware of the structure of packages (and just do the usual for other files, just like meld).

Ok, being not well informed, what is the difference between a merge driver and a merge tool?

From the point of view of GitFileTree, the merge-driver isn't even necessary. Switching the git merge policy for those files to 'binary' is enough, and using meld to cope with the .st conflicts is probably easy enough. I'll be trying that on some of my projects.

I'll write that up somewhere soon.

Looking forward to that.

ThierryGoubier commented 10 years ago

2014-04-30 13:38 GMT+02:00 Tobias Pape notifications@github.com:

Ok, being not well informed, what is the difference between a merge driver and a merge tool?

A merge driver is used by git to do the merge between versions of a file. It is an operation between the common ancestor of a file, the current version of that file and the branch version (the branch we are merging). From what I know, git also uses the merge driver eventually to create the ancestor by merging between ancestors. git has a few merge drivers available, which are text, binary and I think another one.

A merge tool is something which is called by git (git merge-tool) to have an interactive resolution to conflicts. Hence meld and similar tools, with the three panes showing the ancestor, current and other(branch). It will be called for each file in conflict (the one with the '<<<<<<<<<<<<<<' and '>>>>>>>>>>>>' markers) with git merge-tool.

A failure in the merge driver result in a conflict for git, which can then be resolved by git merge-tool (or by hand).

krono commented 10 years ago

On 30.04.2014, at 14:55, Thierry Goubier notifications@github.com wrote:

2014-04-30 13:38 GMT+02:00 Tobias Pape notifications@github.com:

Ok, being not well informed, what is the difference between a merge driver and a merge tool?

A merge driver is used by git to do the merge between versions of a file. It is an operation between the common ancestor of a file, the current version of that file and the branch version (the branch we are merging). From what I know, git also uses the merge driver eventually to create the ancestor by merging between ancestors. git has a few merge drivers available, which are text, binary and I think another one.

A merge tool is something which is called by git (git merge-tool) to have an interactive resolution to conflicts. Hence meld and similar tools, with the three panes showing the ancestor, current and other(branch). It will be called for each file in conflict (the one with the '<<<<<<<<<<<<<<' and '>>>>>>>>>>>>' markers) with git merge-tool.

A failure in the merge driver result in a conflict for git, which can then be resolved by git merge-tool (or by hand).

Nice. (you should put that in some wiki)

ThierryGoubier commented 10 years ago

2014-04-30 15:38 GMT+02:00 Tobias Pape notifications@github.com:

Nice. (you should put that in some wiki)

It tells me that there are things I don't know: the text merge driver shows conflicts in the file, now how is my merge driver tells of an unability to merge a version file?

Thierry

krono commented 9 years ago

I close this. The merge driver is the better variant.