GitoxideLabs / gitoxide

An idiomatic, lean, fast & safe pure Rust implementation of Git
Apache License 2.0
8.91k stars 303 forks source link

copy detection differs from git weirdly #1524

Closed fowles closed 1 month ago

fowles commented 2 months ago

Current behavior 😯

cli/src/commands/{ => file}/chmod.rs
cli/{tests/test_cat_command.rs => src/commands/file/mod.rs}
cli/{tests/test_unsquash_command.rs => src/commands/file/print.rs}
cli/tests/{test_chmod_command.rs => test_file_chmod_command.rs}
cli/{src/commands/cat.rs => tests/test_file_print_command.rs}

Expected behavior πŸ€”

cli/src/commands/{ => file}/chmod.rs
create mode 100644 cli/src/commands/file/mod.rs
cli/src/commands/{cat.rs => file/print.rs
cli/tests/{test_chmod_command.rs => test_file_chmod_command.rs}
cli/tests/{test_cat_command.rs => test_file_print_command.rs}

Git behavior

Git does what I have listed under "expected behavior"

Steps to reproduce πŸ•Ή

I am sorry that I don't have a super simple repro. I only know how to cause this in a larger context.

  1. Install jj via cargo install jj-cli
  2. jj git clone --colocate https://github.com/martinvonz/jj
  3. cd jj
  4. cargo build
  5. target/debug/jj diff -r 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3 --summary
  6. git diff 2de73f57fc9599602e001fc6331034749b2eacb0 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3 --summary

The relevant code for how we use gix is https://github.com/martinvonz/jj/blob/95e8dd51ebfd61712d9fc5f6e19dd95d0026a004/lib/src/git_backend.rs#L1301

Byron commented 2 months ago

Thanks for reporting! I'd think that it is possible to extract a test-case from the trees that show the issue.

Also, I am not surprised as Git implements a more complex algorithm to find the best possible matches. For that it keeps a list of candidates, which gitoxide does not currently do.

One cold hope that once gitoxide implements what Git does more closely, that it will get similar results.

For now I have no plans to improve this, but I should get there at some point. Contributions for this are very welcome, too.

fowles commented 2 months ago

if you help with test reduction and point me to the code involved, I could take a crack at this, but I don't know the code base well, so something to give me a jump start would be appreciated

Byron commented 1 month ago

That's a great idea and I'd appreciate you taking a stab at it. Getting a test ready should be possible to me, feel free to ping me though if it doesn't happen within the next 'days' (whatever that means πŸ˜…).

fowles commented 1 month ago

We have a few users on the jj discord mentioning that they have also seen some oddities with it selecting files for copy detection unexpectedly, so I will definitely take a crack at this after you help set up a little test that I can expand with more examples as we find them.

Byron commented 1 month ago

Sorry for the wait, but once #1529 is merged there is a test for the exact situation described here. Tooling was also added to allow reproducing any real-world scenario easily. With that you should be able to do whatever is needed to get similar results as Git.

Something notably absent is benchmarks, but if you are interested, you could probably set one up based on the new fixture for realistic performance in this scenario.

All the best - please feel free to let me know if you need anything else.

fowles commented 1 month ago

awesome! I will take a look at this later in the week

fowles commented 1 month ago

having started debugging a bit, I am pretty sure the similarity computation is what is going wrong here.

https://github.com/Byron/gitoxide/blob/242fedc973c56b6c1b6f150af99dda972a67f547/gix-diff/src/rewrites/tracker.rs#L524

In the provided example "cli/src/commands/file/mod.rs" and "cli/src/commands/cat.rs" are reported to have a 0.9977956 similarity which seems very wrong if you look at the file contents.

Byron commented 1 month ago

Thanks for sharing, that could be the spot! I still remember that I 'tuned' it to provide the same results as Git in some specific examples, so it might have been over-fitted there. With that said, ideally the new version of it will still produce the same results as before for these examples, as these are (at least supposed to be) modelled after a Git-given baseline.

I also wonder if even with that fixed one can obtain the same results without also adding the additional heuristics and candidates to find the best possible match.

fowles commented 1 month ago

https://github.com/Byron/gitoxide/blob/242fedc973c56b6c1b6f150af99dda972a67f547/gix-diff/src/rewrites/tracker.rs#L564

Is that supposed to be a += not an =?

fowles commented 1 month ago

that was it. Sending a PR shortly

fowles commented 1 month ago

https://github.com/Byron/gitoxide/pull/1535

fowles commented 1 month ago

resolving this until we find any more issues!