Closed aras-p closed 6 years ago
Actually, the #include "file.h"
was a deliberate design,
because we don't know the folder structure of the user project,
so files could be regrouped differently.
For example, sometimes, all files are regrouped in a single directory.
By "delegating" the path setting to the build system, it makes it possible to re-arrange files in any way the user wants.
But indeed, it requires the user to setup paths, which is a task.
It seems difficult to find an ideal solution here, which would please both situations.
Making the #include
relative to current file would skip the need to setup paths, but would also oblige all user projects to preserve the directory structure of zstd
repository.
Not sure which one is better ...
Yeah that makes sense. Perhaps point that out in projects/readme.md?
If you're on the fence which way is better, then I don't think I ever saw a 3rd party library integration where files were rearranged (but I might have very limited data). Keeping file structure makes it easier to upgrade to later versions, since it's just "copy over" instead of flattening the files again.
Re-arranging the files is an investment of future time multiplied for every platform.
We edit in the relative paths and dedupe the repeated functions and then:
namespace zstd1{
extern "C"{
#include "1/zstd-0.8.1/lib/common/xxhash.c"
#undef FORCE_INLINE
#include "1/zstd-0.8.1/lib/common/zstd_common.c"
//
#include "1/zstd-0.8.1/lib/compress/huf_compress.c"
// #include "1/zstd-0.8.1/lib/compress/zbuff_compress.c"
//
#include "1/zstd-0.8.1/lib/common/fse_decompress.c"
#include "1/zstd-0.8.1/lib/decompress/huf_decompress.c"
// #include "1/zstd-0.8.1/lib/decompress/zbuff_decompress.c"
// #include "1/zstd-0.8.1/common/zstd_static.h" }
}
I want to note that I've hit this particular issue as well. In general being able to use a third party library without rearranging or changing headers is nice; I'd like to be able to just add zstd as a folder or as a git submodule and compile it without adding internal zstd folders to the include path. I can do this with many third-party libraries, including lz4 ;) but not with zstd.
If relative include paths are inconvenient one potential solution is to just put all zstd source files and internal headers in a single folder... (that is, to change zstd repository structure so that there's just one flat source folder)
Actually, I was going to say that I planned to dropped this ticket because experience showed us it's great to be able to move files around without the need to modify them (all #include
statements remain the same).
Of course, it depends on having all relevant directories listed in include
path. That's something the provided Makefile
does, also the cmake
script, and also the Visual solution.
But indeed, it's always possible to have another build environment, for which no script/config is provided. In which case, it's necessary to set these directories, which is an annoyance.
What I could propose then, is another directory, like lib-flat
for example, where all relevant files would be present. This is similar to @zeux suggestion, though it would be an additional directory.
The simplest way to do it would be to add target files as symlinks. Unfortunately, this does not work for Windows.
So it could be replaced by some script, something like lib-flat/createLibFlat.sh
for Unix and lib-flat\createLibFlat.bat
for Windows. They would simply copy (or symlink) the relevant files into target directory.
Unfortunately, there is also the question of "what means relevant files ?". This is the reason why source files are separated into multiple directories to begin with. I suspect decompression is a must, and compression is likely wanted too. But it's less clear if the dictionary builder is wanted. And even less clear for legacy format support.
Should we add options to these scripts then ? As can be seen, what started as a trivial work around sounds like a mini project now...
Better have the specs right before starting.
There's a way for this to not be a large effort or require a spec - just flatten the hierarchy inside lib/. zstd uses few files so I don't think I see the value in hierarchy inside that folder. You can use name prefixes if there's really a need to split files in multiple groups.
This solution is easy, pragmatic and maintains the option of moving files into different folders in case somebody wants to do that for whatever reason (is there really a use case for shuffling files from zstd/lib around?)
The primary reason for the sub-directory structure in /lib
is to allow users to include or disable specific features. It's probably something a newcomer will not think about initially. But experience tells that when integrating the library, many users will choose to not include lib/deprecated
directory. Nor the lib/legacy
one. And while at it, not even the lib/dictBuilder
one.
More rarely, some would only need the decoder, or just the encoder.
Having a directory structure make these choices relatively easy to apply. Moreover, code lives, with new files added, existing ones being split, old ones being deprecated. Such situations are also relatively simple to manage : compulsory files get into lib/common
, compression related ones go into lib/compression
, etc. Existing build scripts automatically pick up such changes, and continue generating numerous variants with more or less features enabled or disabled.
I'm not against another solution, especially if it can help other use cases and build system, but such alternative shall preserve current capability to selectively drop or enable library features.
Right. The way I see it there are several conflicting requirements here:
Current solution solves 1, 3, 4 at the expense of 2. Changing includes to be relative solves 1, 2, 4 at the expense of 3. Introducing symlinks or separate build scripts that create a different set of sources solves 1, 2, 3 at the expense of 4. Switching to a flat file structure solves 2, 3, 4 at the expense of 1 (technically it makes 3 moot instead of solving, but whatever).
One other solution that hasn't really been discussed here is amalgamation (making a script that merges all internal headers/sources into one .c file, similar to sqlite and some other libraries) - it sort of violates 4, although not as strongly as symlinks, and makes 1 impossible, but solves 2,3 and makes zstd completely trivial to integrate (adding just one file to an arbitrary build system of choice - libraries that are structured like this are fantastic to use, lz4 basically worked that way if you didn't use HC mode).
I guess from my perspective, 3 is way less valuable than 2 - not solving 2 breaks basic user expectations (I can't add zstd to my Makefiles/Visual Studio projects/etc./etc. without adding internal zstd folder to my include settings or changing zstd files or moving them around), not solving 3 feels like some niche use case that I don't understand isn't satisfied. Obviously I'm not the project developer so I don't necessarily have the same view of these problems - all I'm saying is that 2 was (and is) a problem for me, and I've never encountered the need to do 3 for any libraries.
These are valid talking points @zeux . Amalgamation is a topic I've been interested in lately. As I see it, it's interesting purely for ease of integration, as we already know that compilers do not like such giant source files (performance suffer, as optimisation heuristics get drowned in the huge number of opportunities to analyse). But even if there is a little speed hit, that may be still worth it.
The way I imagine it could work is to produce it with a dedicated script.
We still need to work on clearly separated files during development.
We could, for example, create another directory, like lib-amalgamated
, where the source code would be simplified to a zstd.h
and a zstd.c
. We could also make sure it's always updated before each release.
There are however 2 remaining issues :
cat *.c
is not enough, files have to be edited, especially #include
statements, order of inclusion is important, many symbols that were isolated now become duplicated, etc.At the end of the day, I feel the issue is not so different from previously described idea of providing a lib-flat
directory, where all "relevant" files would be present directly a root. There is the same issue of selecting the "right" scope (which can default to the same compression+decompression scope), it can also be prepared in advance for each release, the script could also be allowed to run by users with the possibility to select different options.
But the big advantage is that it only copy files, so it's much simpler than amalgamation.
As a bonus, it's a bit better for runtime performance, and compatible with partial and multi-threaded compilation.
The downside, compared to amalgamated source, is that it requires the host project to setup one directory. That's better than several ones, but, indeed, more than zero.
As I see it, it's interesting purely for ease of integration, as we already know that compilers do not like such giant source files (performance suffer, as optimisation heuristics get drowned in the huge number of opportunities to analyse). But even if there is a little speed hit, that may be still worth it.
Has this been the case for zstd? My experience does not match that at all - there absolutely are issues with large functions with complex CFGs, but I haven't seen pessimisation from large sources where individual functions are reasonable - if anything, in some cases amalgamation improves performance due to better inlining (which you could also get by using whole program optimization, but that significantly slows down linking and isn't available on all platforms).
With respect to the scope of amalgamated library, what I was thinking of is just the default compression+decompression function set (that's likely the 90% use case). And yeah there definitely can be edits that are required for amalgamation to work (for zstd specifically I actually tried compiling it as an amalgamated source, the issues are pretty minor - CHECK_F
macro is redefined in multiple source files, and one function, ZSTD_limitCopy
, is defined twice).
@zeux zstd compression saw a small speed boost when breaking a large file up into multiple segments, see PR #831.
@terrelln I'm not quite sure I understand whether this is a speed boost or just noise; I see compression and decompression speeds going both up and down in the first gist linked in that issue.
@zeux some of it is just noise, I didn't present the data very well since I didn't write it with presentation in mind. The difference shows itself on my devserver with all the gcc compilers on level 1 and 3. I'm certain those aren't just noise, since I repeated the benchmarks many times with consistent results. The difference doesn't show up on entropy, where those levels are about the same speed before and after.
We expect the number of files to continue to grow for each category in the future. For the time being, the current directory design has proven to work pretty well, and will continue to serve us well as the code base expands.
I understand that setting up #include
directories in 3rd party build systems can be a pain.
A proposed solution is to provide scripts which would copy all relevant files into a dedicate "single-level" (flat) directory, so all files are in the same directory, and #include
work out of the box.
We can certainly develop such script if there is a need.
As this topic has not been active for quite some time now, I'll close it. We can certainly re-open it if needed.
compilers do not like such giant source files (performance suffer [...])
SQLite is faster when built from its amalgamation, compared to individual source files in the source repo. Or were you instead referring to performance of the build/compilation itself, perhaps?
Use of the amalgamation is the recommended way to consume SQLite, and how Fossil (SQLite's SCM) uses it.
I would love an amalgamated zstd release.
Regarding feature selection, SQLite integrates most everything into the file, its core of course, but also a bunch of extensions, and has a bunch of OMIT #defines to let the client integrator enable/disable features, to tailor its build from lean-and-mean all the way up to the kitchen-sink.
BTW, as already mentioned above, lz4 is available single-file, it's a pity zstd isn't as well, given their relation. See also https://github.com/nothings/single_file_libs#compression
Zstd can be amalgamated with the scripts in contrib/single_file_libs.
Thanks for letting me know. But then, why isn't this part of the release artifacts at https://github.com/facebook/zstd/releases/tag/v1.4.9 ?
The goal of a single file is the lower barrier to entry. Just download it, and add it to your project. If instead one must setup what appears to be a Linux-only dev-env to be able to get to generate amalgamated sources, that's definitely not a lower barrier than downloading a couple files from an official release. It should be a release product.
One of the issue is that the content of the single-file library can be customized. To begin with, would you want both compression and decompression, or just decompression ? Do you want support for dictionary compression ? For legacy format ? Should the code be optimized for binary size ? or speed ? should it target AVX2 ? BMI ? NEON ? WASM ? etc.
The amalgamation script makes it possible to navigate between these various possibilities.
There might be an opened question regarding the immediate availability, at release time, of some "common" subsets. It should probably be paired with more detailed documentation of the build script.
@Cyan4973 First, thanks for responding. I appreciate the time you are taking.
Second, regarding customization, I already expressed my wishes in https://github.com/facebook/zstd/issues/305#issuecomment-828226389
I.e. that customization should not be at amalgamation-generation time, but at compile-time of the amalgamation by the client build, via #ifdef in the amalgamation itself.
So the amalgamation always contains everything, compress, decompression, dictionary support (legacy stuff, that's more debatable), but those various subsystems are enabled/disabled via #defines.
This is again how SQLite does it, and it works rather wells (used by thousands of organizations, running on billions of devices). See https://www.sqlite.org/amalgamation.html And https://www.sqlite.org/compile.html#omitfeatures for all #defines controlling which parts of the amalgamation are active.
Sure the amalgamation is big, but that's not an issue. That's simpler, because there's a single amalgamation to build, that contains (almost) everything. But the client is still in control of which subsystems are actually part of his/her build.
When you talk about optimizing for a size or speed, particular instruction-sets, etc... That's none of the concern of the amalgamation-generation script, but the consumer of the amalgamation.
Does that make sense? Is that clearer?
Thanks/merci, --DD
Sure, that seems doable.
But right now, that's now how it works, and getting there will be a non-trivial task.
Some options can already be selected this way (and maybe we should document them better),
but others are entirely driven by the presence or absence of related source code.
Typically, "decompression only" requires that source code from lib/compression
be absent in amalgamation.
I don't expect such an objective to be achievable for next release. Furthermore, having an existing different way of achieving essentially the same thing lessen the attractiveness of large refactoring with potential maintenance pitfalls. Even when it's done, we'll have to carefully monitor for potential side effects we can't guess yet.
All in all, that could be an item for future improvement, but on short term, the build script will remain our main way of achieving these specialization objectives.
OK. I'm a bit disappointed, to be honest, but I understand of course.
That being said, could the current single_file_libs script be used to generate some amalgamation, part of each release artifacts, which includes compression, decompression, and dictionary support? To at least have an amalgamation available with each release?
I.e. I don't get the choose/customize what I'll build, but at least I can use zstd on Windows easily now, just copying a couple files from the release ZIP into my own VS project, the same I use SQLite.
We'll look into it.
One of the issues with this approach is the risk for "stalled" amalgamation, that would no longer correspond to the release's content. I think that's what happened in the past, and why we decided to remove pre-amalgamation.
I also use this opportunity to remember you that Windows now offers WSL and WSL2 (Windows Sub-System for Linux) these days, under which these scripts work perfectly. msys2
(which I use when I'm on Windows) also offers same capabilities.
Thanks Yann.
risk for "stalled" amalgamation
I don't quite understand what you mean. SQLite's amalgamation is built using a make target. It is part of the build. There cannot be a "stale" amalgamation, when it's generation is part of the build process, and the release automation.
Regarding your last point, I'll grant you that. I'm not using WSL yet, but have used msys2 in the past.
But I'd reply that portable scripting is also readily available, and SQLite's amalgamation is built using portable TCL, which is already required for a from-souce (repo) build; While using amalgamation avoids even that TCL dependency of course, that's kinda it's whole point of the amalgamation.
PS: Those days, sticking to BASH + Unix utilities (grep, sed, awk, etc...) for scripting, when options like Python, Ruby, Tcl, and the myriad others exist, is a bit like sticking-it to the primarily Windows devs, a popular sentiment among Unix'ians, I know :)
Zstd can be amalgamated with the scripts in contrib/single_file_libs.
Note that the path has moved to build/single_file_libs
Integrating ZStd into other "non-standard" build systems is slightly harder than it should be, primarily because it requires setting up custom #include paths for the source files.
E.g. zbuff_compress.c does an
#include "error_private.h"
which is a file in a different folder.Changing all the includes to be relative to the source file location would make this problem go away, e.g. if the same include was
#include "../common/error_private.h"
and so on.