Add streaming support for .gz and .bz2 format input / output files

joelduerksen commented 2 years ago

I'm finding very large turtle/triple datasets may be best kept in compressed form, if only to not be throttled by disk I/O max speeds while reading/writing them with a blazing fast library. Two common compressions I'm running into with triples store data are gz and bzip2. Would you consider adding the ability to stream out and back into compressed forms? There are libraries for Python to make it easy for both formats, so I'm hoping the same might be true here?

drobilla commented 2 years ago

Are you using the command-line utility, or serd as a library?

For code, it should be relatively easy to hook up serd to whatever compression library using custom read and write functions. Since serd is a lightweight library with no external dependencies, I don't think it's appropriate to add dependencies for this (and don't want to step on the feature creep treadmill of whatever archive format somebody wants this week). If the API makes this too difficult (there's a ton of archive libraries I have never tried), the reasons why should be addressed. I have already revised that heavily in the upcoming major version (serd1 branch) and imagine it should hook up nicely to more or less anything, but it'd be good to double-check various popular libraries before committing to the API.

That said, I'd be more open to adding it to the command-line utilities since that should be easy to make optional and doesn't add a dependency to the library itself. That code would also serve as an example to steal for other programs/libraries that want to do it. On the other hand, there you could just set up a pipeline...

joelduerksen commented 2 years ago

Just using serdi at the command line to convert to N-triples right now, so I can examine the output of serd. Agree I was thinking maybe a compile time option to add support into serdi might be reasonable. I understand the goal of lightweight with no external dependencies (I like that too).

My next step is to try using the library (and I could add decompress support, understood). Agree examples are very helpful, it looks like the code for serdi itself might be the best example to start with? I hope to read any one of the supported formats (but compressed), do some minimal processing/filtering of the triples as they fly by, and then store subsets in a few separate files. Simple use, but it needs to be performant, and streaming, hence my interest in this library.

drobilla commented 2 years ago

I see. For things like that, if you want to dig into the code, you might want to start with the aforementioned serd1 branch, even though it's not out yet. There is a lot more there around processing streams (including a utility specifically for filtering) and the API is quite a bit friendlier and more polished. You can do it with the current stable branch too though, there's always been facilities for custom functions. Unfortunately there's not much example-based documentation (niche within a niche here, never seemed worth my time), but you should be able to figure it out from serdi or just poking through serd.h.

If you're a Python fan, I'm working on Python bindings in the serd1 branch as well. They're not quite done yet though (I think the current tip doesn't even build, bit of a mess right now). Earlier WIP of the documentation here, for example: https://drobilla.net/files/pyserd_docs/ . I hope to finish this stuff up shortly, but have a lot of balls in the air right now... if you're interested in this I can ping this issue when they're ready(ish), feedback would be helpful.

As for the issue at hand, we can ponder whether built-in support is worth it for convenience, but you can always just throw some UNIX at the problem, e.g.

zcat mydbdump.ttl.gz | serdi -

joelduerksen commented 2 years ago

I'm fine with C (likely faster), but python is fine if easier to use and as nearly as fast. I use any language as needed, if I had to pick a language I'd identify as a K&R C fan. Yes, I am interested to hear when serd1 is done. One question (maybe this is a can of worms?), why do you use a dash for stdin, instead of just reading stdin when no file is given like common file commands do, e.g. cat, sort, uniq, cut, etc. If I'm not mistaken using a dash is a niche behavior used only by a select few apps. It feels unnatural to need to add a dash parameter if piping data to serdi... I keep forgetting serdi needs it...

drobilla commented 2 years ago

Okay, I was must guessing from the python libraries comment.

The - thing is a pretty universal convention for tools that are usually used with file inputs (which are friendlier in this case because then a base URI and syntax can be determined), but I suppose it could perhaps work without. In any case, please open separate tickets for unrelated issues to keep the tracker on point.

joelduerksen commented 2 years ago

You can close this ticket. Thank you for the notes, agree this is not core, and there are more important things to work on

drobilla commented 2 years ago

Okay. I will keep it around for now as a reminder, since I would like to make sure that at least, for example, it's easy to wire up libarchive to the read/write APIs.

I probably won't add support to the tools themselves for initial release (I'm really struggling to finally get this out, so non-API-affecting feature creep in general is out), but it should be easy enough to add as a feature in a minor release.

drobilla / serd

Add streaming support for .gz and .bz2 format input / output files #34