eBay / tsv-utils

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
https://ebay.github.io/tsv-utils/
Boost Software License 1.0
1.42k stars 80 forks source link

tsv-filter --label option #338

Closed jondegenhardt closed 3 years ago

jondegenhardt commented 3 years ago

Description

This PR adds a new feature to tsv-filter: Marking each record as either passing the filter test, or not.

Consider the following command, which identifies lines where the Color field is a primary color.

$ tsv-filter -H --or --str-eq Color:Red --str-eq Color:Yellow --str-eq Color:Blue data.tsv

The above filters out all records not satisfying the test. However, it is often desirable to keep all the records, instead marking the records to indicate the matches. The following command does this, adding a hew field, IsPrimaryColor populated with values 1 or 0 to indicate pass or not.

$ tsv-filter -H --label IsPrimaryColor --or --str-eq Color:Red --str-eq Color:Yellow --str-eq Color:Blue data.tsv

The label values can be customized using the --label-values option. To change the above to used true and false, run:

$ tsv-filter -H --label IsPrimaryColor --label-values true:false --or --str-eq Color:Red --str-eq Color:Yellow --str-eq Color:Blue data.tsv

Implementation

Adding the label field is straightforward. In the main loop, instead of choosing to output a line or not, an indicator is appended. However, the additional conditional tests in the loop caused a performance degradation. This was partly due to the recent addition of the --count option, which counts the number of records satisfying the criteria. The performance degradation was minor for wide files with long lines, but substantial for narrow files.

To regain performance the code was templatized to reduce the number of tests in the main loop. In addition, some changes to BufferedOutputRange to streamline that code. It had also added some additional checks as part of the recent --line-buffered support. Between the two changes all the original performance was regained, and possibly a bit more.