apache / datasketches-cpp

Core C++ Sketch Library
https://datasketches.apache.org
Apache License 2.0
225 stars 71 forks source link

Reorganization proposal #419

Open tjstum opened 10 months ago

tjstum commented 10 months ago

Hi there,

We are consider using datasketches in our C++ environment. The current organization, where there a bunch of folders, each with an include/ directory, is a bit awkward for integrating into a foreign build system. Projects like pybind11 put all of the headers in a single include/ directory (perhaps with subdirectories underneath). Then there is just a single path to add as an include path.

Would you be open to a reorganization that moved around files to look a bit like this:

include/datasketches
  count/
    count_min.hpp
    count_min_impl.hpp
  common/
    MurmurHash3.h
    ...

Thanks for your consideration.

jmalkin commented 10 months ago

Interesting. I believe we install everything into a single datasketches directory, so we'd assumed the library would typically be installed. You're probable not doing that?

Certainly not trying to tell you how to develop your project, just trying to understand what people do in practice. We don't get much feedback about that.

I don't think I have an inherent objection to this. Will try to look into it in the near term.

tjstum commented 10 months ago

Happy to send a PR for this as well.

Yeah, given that the project is (currently) header-only, there's not a huge benefit for us to install it instead of vendoring the files (which is an approach we've taken with other libraries that we plan to integrate with tightly). Currently, I've just flattened all the files into a single directory (basically what the build/install does).

jmalkin commented 10 months ago

If I try to add any new matrix sketches I think I'd try using Eigen (we have Frequent Directions in java for approximate SVD, for instance, but there may be other approaches that make that less interesting to port this many years later) which is also header-only. That's a combination of header-only considerations as well as playing nicely with our python wrapper (nanobind, which we adopted recently instead of pybind11).

@AlexanderSaydakov noted on Slack that this would be a major version change, though, and we did just do one of those. We do try not to do major version bumps toooooo frequently. So that'd probably be the main consideration.

jmalkin commented 7 months ago

@AlexanderSaydakov Several months later, I don't know if we want to revisit this yet?

AlexanderSaydakov commented 6 months ago

This seems reasonable. I can see how fragmented "include" directory structure can complicate someone's life. However, if this is not urgent, I would just keep it in mind for the next major version.

jmalkin commented 1 month ago

@tjstum I know it's been a while but we haven't forgotten this. We discovered a change we merged recently does break the API slightly in some edge cases so we'll likely pick this up for the next release.