eBay / tsv-utils

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
https://ebay.github.io/tsv-utils/
Boost Software License 1.0
1.42k stars 80 forks source link

How to best use the code as a library? #253

Open Robert-M-Muench opened 4 years ago

Robert-M-Muench commented 4 years ago

I like TSV utils a lot. I'm wondering what's the best way to use the code as a library in own applications?

Is it planned to separate the generic code-parts into a library? Maybe even add them to Phobos? IMO that would make a lot of sense too.

jondegenhardt commented 4 years ago

Thanks! I'm you like the tools.

Can you describe the types of functionality you'd like to see in a library?

At present there are no plans to separate out and release library components from the individual applications. This could be done. It's mostly a matter of whether the time investment would be worthwhile.

That said, it is possible to use the functionality in the common directory as library components. For an example, see dcat-perf. This uses the buffered IO routines in common directory to do some performance testing. The dub.json file lists the dependency ("tsv-utils:common": "~>1.4.1") and source/app.d imports and uses the IO routines (e.g. import tsv_utils.common.utils : bufferedByLine;).

However, I wouldn't recommend this anything really serious, simply because these features haven't been published with the intent of being a general library. For example, if it turns out that the tsv-utils need a change, these modules may get changed in a non-backward compatible way. The features in common are well tested though and should be relatively solid.

There are some generally useful features in common, and I'd be interesting in hearing whether any are useful to you. A good place to see the documentation is: tsv-utils.dpldocs.info/tsv_utils.common.

I'm guessing though that many of the more desirable features are higher up the stack. csv-to-tsv conversion, sampling routines, filtering, uniquing, etc. I'm definitely interested in hearing your thoughts on this.

jondegenhardt commented 4 years ago

Well, I'll list a couple things I thought of that would be library candidates.

One category is low-level utilities for manipulating TSV data. Here's the main thing is the inputFieldReordering. As it is it is useful but the interface is a bit rough. However, it would be especially useful in conjunction with support for named fields. There are a couple other worthwhile enhancements that could be added as well.

Another category is algorithms that could be applied to streaming data generally. Quite a lot of tsv-utils is designed to operate on indefinite or infinite length input streams. tsv-filter for example, and a number of other tools and algorithms as well. It would be helpful to try these in some alternative, representative environments prior to turning them into library utilities. That would help ensure a generalized enough API was being provided.

There are also some algorithms useful outside the context of an input stream. This is a smaller set, but there are useful things that could be done.

Robert-M-Muench commented 4 years ago

Sorry for answering late. Here are some thoughts/ideas:

jondegenhardt commented 4 years ago

Thanks, that's a useful list.