allenai / wimbd

What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
Apache License 2.0
172 stars 18 forks source link

Add `--with-locations` flag to `wimbd search` #14

Closed epwalsh closed 3 months ago

epwalsh commented 3 months ago

For including the location of each match in the output. Closes #13.

For example:

cargo run -- search test_fixtures/c4-sample.00000-of-00001.json.gz -p '\bCascara' --with-locations --json

will output:

{"count":5,"matches":{"test_fixtures/c4-sample.00000-of-00001.json.gz":[{"line_num":24,"submatches":[{"end_col":7,"start_col":0},{"end_col":16,"start_col":9},{"end_col":279,"start_col":272},{"end_col":330,"start_col":323},{"end_col":339,"start_col":332}],"text":"Cascara (Cascara sagrada) by Eagle Peak Herbals: This North American shrub is a well known laxative and colon cleanser that has been widely used by physicians as well as native peoples. Many commercial preparations intended to treat constipation contain the cured bark of Cascara sagrada. Famous for \"next morning results.\nCascara (Cascara sagrada), certified organic grain alcohol, and distilled water."}]},"pattern":"\\bCascara"}