aspiers / git-deps

git commit dependency analysis tool
GNU General Public License v2.0
298 stars 47 forks source link

Dependency output order bad ... #62

Open dwm1945 opened 7 years ago

dwm1945 commented 7 years ago

So my dep list is about 180 commits ... so I extracted the column 2 list, reversed the order and used the result to form a git cherry-pick command. The first commit in the resulting command failed because of conflicts. Digging and digging I figured out that the dependency order doesn't work. The dependency is written to stdout when first encountered. If a subsequent commit also depends on it, it needs to be applied before that 2nd commit.

I'm not sure yet how to achieve that, partly because my git expertise is minimal. Essentially, the deepest dependency would seem like the right output order.

aspiers commented 7 years ago

Yes, I think you are right that this is a valid issue. Currently recursive discovery and outputting is done in a single phase, and indeed that was a choice I made deliberately in order to avoid a large wait in the UI and be able to start issuing output incrementally as dependencies are discovered.

However, ordering can be required, and the cherry-picking use case you state sounds like a very valid one. So one solution would be to add an option to support this which would split recursive discovery and outputting into separate phases, with ordering performed in between the two phases. I'm not likely to be able to implement this any time soon, but I'd gladly review a PR addressing it.

However another quicker (and more elegant, UNIX-like "one tool per job") solution might be to pipe the non-verbose output through tsort(1). I doubt it's a coincidence that I originally chose this output format which would be compatible with tsort ;-)

dwm1945 commented 7 years ago

Ordering wise, I found what seems to be a simple solution ... I created and maintain a dependency list and each time a dependency is found (new or old), I logically move it to the end of the list. At the end of the run, I output that list in reverse order (so the deepest dependency is first). I let the other output continue because there is value to having incremental output. Probably better to have an option and write to a file but right now this is about my backport not nice code.

Unfortunately, there remain issues, somehow the deepest dependency isn't found. Now my manual blame checking doesn't identify the missing dependency (or as previously, out of order). I'm guessing something wrong with my logic building the exclusion OR perhaps the excluded commit should actually be the last processed. In any case, I've concluded that our development style creates too broad a set of changes in the cherry pick by dragging in unrelated stuff.

aspiers commented 7 years ago

Hmm, your algorithm doesn't sound reliable to me - I think you may have just got lucky in this case. The order in which dependencies are discovered is not necessarily perfectly correlated to the depth in the dependency tree. Please can you try the tsort approach I suggested and let me know how it goes?

But also, remember that this dependency detection is a heuristic which is sensitive to factors such as the amount of context lines used. Nor does it detect logical (non-textual) dependencies. So it will never be a perfectly reliable fully automated solution. The best next step would be to use the web UI to recurse, and at each stage manually verify you're getting sensible results.

dwm1945 commented 7 years ago

In terms of luck, you're probably right ... my simple hack did not allow for dependencies which move having dependencies which must still be applied before them.

tsort doesn't work because it doesn't know about dependent relationships found after the dependency is output in the original output. If might work if the additional dependencies were recorded.

aspiers commented 7 years ago

tsort doesn't work because it doesn't know about dependent relationships found after the dependency is output in the original output.

I'm not sure but it sounds like you misunderstood my suggestion. I was suggesting to pipe the output of git-deps to tsort. tsort will not do any sorting until it receives the entire set of discovered dependencies, so there would not be any "dependent relationships found after the dependency is output in the original output". Or am I misunderstanding something?

dwm1945 commented 7 years ago

I did the equivalent of piping into tsort by cut paste of the full set of dependency pairs.

Let me try to explain ... Assume that the dependencies look like:

A <- B <-C <- F <- D <- G
    <- E <- D <- G
    <- H <- K <- L <- M <- D <- G
                             <- N <- P

The dependency on D will be detected on all 3 sub trees where the sub-dependency on G will also be detected. From my observation, the dependency on D will only be recorded in the output the first time it is found. For tsort(1) to properly order the dependencies, it must see all 3 uses of D.

The flaw in my logic was that G was not moved to be installed before D when D was moved to the deepest (earliest apply) position.

aspiers commented 7 years ago

On 17 Oct 2016 7:48 p.m., "David Morris" notifications@github.com wrote:

I did the equivalent of piping into tsort by cut paste of the full set of dependency pairs.

Let me try to explain ... Assume that the dependencies look like:

A <- B <-C <- F <- D <- G <- E <- D <- G <- H <- K <- L <- M <- D <- G <-N <- P

The dependency on D will be detected on all 3 sub trees where the sub-dependency on G will also be detected. From my observation, the dependency on D will only be recorded in the output the first time it is found.

Ah, no, that should not be the case, and if it really is then I believe you have found a bug.

For tsort(1) to properly order the dependencies, it must see all 3 uses of D.

Correct, and I would expect it to.

The flaw in my logic was that G was not moved to be installed before D when D was moved to the deepest (earliest apply) position.

Right.

dwm1945 commented 7 years ago

Well, if something wasn't pruning the search, my logic would have worked fine because G would have kept moving after D was moved in the ordered output list.

I think I saw logic that pruned the search whenever an already known commit was seen again.

aspiers commented 7 years ago

I think I saw logic that pruned the search whenever an already known commit was seen again.

Please can you provide a test case showing this? Like I said, I believe that should never happen, so if it does I would like to fix the bug.

wetneb commented 11 months ago

I am interested in improving this. My use case is that I use git-deps to clean up commits which fix other commits, by squashing the fix commit with the commit it should fix. This commit is generally the latest one in the dependency list output by git-deps. But the lack of ordering means that I often need to browse through a pretty long list to find the right one.

The topological sort mentioned above would be ideal, but it does have the limitation that one needs to wait for the entire search before outputting anything.

One other approach could be to make sure that the queue of commits to inspect is always sorted by decreasing date (from the commit's metadata). It's only a heuristic since dates can be manipulated in arbitrary ways - it is perfectly possible for a commit to depend on a commit with a later date, but I think in most real-world use cases it would do a decent job at ordering commits in a sensible way.

I think it could be worth offering both sorting options (sorted by date and topologically sorted) on the command line. Maybe the sorted by date could be the default?