dagolden / Path-Tiny

File path utility
41 stars 59 forks source link

add interface to core's File::Compare #255

Closed grr closed 2 years ago

grr commented 2 years ago

I often have to compare the contents pointed to by the paths. I had been using the digest method, which works fine for the purpose, but I just benchmarked against File::Compare and the latter is 15-30x faster, so I'm definitely switching to using compare.

According to corelist: File::Compare was first released with perl 5.004, so it would not add an additional dependency.

xdg commented 2 years ago

Thanks for the suggestion. This does fit the mission of Path::Tiny -- making it easy to do things the "right way".

path($one)->compare($two)

Path:Tiny can do a raw open on the two files and provide the handles to File::Compare. And it can default to a 64k buffer size, which appears recommended as a good default for I/O.

I'm not inclined to support the compare_text method because dealing with file encodings leads to even more API bloat and is already a specialized case. Users can open the two file handles directly and iterate them easily enough.

Note to self: compare isn't a great verb for this. Many (but not all) boolean methods start with is. Maybe compares_equal?

xdg commented 2 years ago

@ap -- you've got good perspective on APIs -- what do you think about naming?

ap commented 2 years ago

My first thought was “what’s wrong with compare?”

I don’t feel that compares_equal improves on it.

And I couldn’t immediately think of anything else that does.

OTOH, taking a step back, I don’t actually disagree that compare is a bit meh.

Indeed, on second thought, it occurs to me that “compares” and “equal” both say the same thing, and that’s the problem, no? It’s just meh twice. You want to say what is equal. Which would be the contents, which means you are looking for something like contents_equal.

My only misgiving with that particular suggestion is that it completely loses any connection to File::Compare as the underlying module.

(FWIW I agree with not supporting compare_text.)

grr commented 2 years ago

which means you are looking for something like contents_equal.

My only misgiving with that particular suggestion is that it completely loses any connection to File::Compare as the underlying module.

It sounds like we're already assuming the return value will be reversed (unless we name it contentsdiffer). Looking through Path::Tiny's docs, there are a number of tests that are named `is(is_absolute,is_relative,is_rootdir), beside the file test operators, so why not something like$a->is_same_content($b)`?

The docs should also probably mention that File::Compare does not check if the underlying files are the same (it will happily take the same file as both inputs), so the user might still want to compare with realpath(). And there doesn't appear to be a portable way to check if a file is a hard-link of another: https://www.perlmonks.org/?node_id=859658

ap commented 2 years ago

All the existing is_foo methods are simple predicates about the invocant itself. This new method would be an odd one out among the set, so I don’t know that that’s a good place to locate it.

Just linguistically is_same_content is nails on a chalkboard though. It would have have to be called has_same_content anyway, or else something like is_identical_in_content.

This all gets rather cumbersome IMO.

It sounds like we're already assuming the return value will be reversed

That’s a good point. I don’t know that I’m happy with that.

xdg commented 2 years ago

Annoyingly, File::Compare has 0 for unequal, 1 for equal and -1 for error which means it doesn't actually work like cmp, which would make a name like cmp_content descriptive and intuitive. (Could reimplement to do those semantics without File::Compare...)

grr commented 2 years ago

The docs should also probably mention that File::Compare does not check if the underlying files are the same (it will happily take the same file as both inputs), so the user might still want to compare with realpath(). And there doesn't appear to be a portable way to check if a file is a hard-link of another: https://www.perlmonks.org/?node_id=859658

The unix cmp command (which is part of the POSIX spec, so is available on most systems) does both of these checks: http://git.savannah.gnu.org/cgit/diffutils.git/tree/src/cmp.c#n284 And it's exit value is similar to to File::Compare::compare's return values: Exit status is 0 if inputs are the same, 1 if different, 2 if trouble.

xdg commented 2 years ago

In the spirit of "write the docs first", here's what I propose. I think it's hard to misunderstand or misuse.

=method has_same_bytes

    if ( path("foo.txt")->has_same_bytes("bar.txt") ) {
       # ...
    }

This method returns true if both the invocant and the argument can be opened as
file handles and the handles contain the same bytes.  It returns false if their
contents differ.  If either can't be opened as a file (e.g. a directory or
non-existent file), the method throws an exception.  If both can be opened and
both have the same C<realpath>, the method returns true without scanning any
data.

=cut

Note: there is a subtlety with realpath where a realpath to a non-existent leaf is possible, which is why both must be opened before the realpath check.

ap commented 2 years ago

LGTM

grr commented 2 years ago

Just linguistically is_same_content is nails on a chalkboard though. It would have have to be called has_same_content anyway, or else something like is_identical_in_content.

I Just realized why i was having the opposite feeling. has implies a membership test, whereas is implies an identity test. And both of those words are frequently used in Perl with those contexts: has with Moo/Moose (though it's used to install a member into a set, instead of testing for membership), and is with the isa op.