LostRuins / datasetexplorer

Easily view and modify JSON datasets for large language models
GNU Affero General Public License v3.0
62 stars 9 forks source link

S/R by regex? #1

Open inflatebot opened 3 weeks ago

inflatebot commented 3 weeks ago

Searching and filtering by regex is nice, but what would really help is being able to replace sequences via regex. I'm currently combing through a presently 120MB dataset removing poisonous tendencies, and one of the big ones is link spam; but you basically have to do that via regex or else manually delete every one by hand, which the Explorer isn't so good at doing on a large scale. I would generally prefer to be using Dataset Explorer over vscode, not least because it's way more performant on sets of this size, but also because doing regex stuff with vscode has a risk of breaking the JSON formatting.

LostRuins commented 3 weeks ago

This is tricky for the reason you just mentioned - the search queries are currently run on the composited sample data, not individual conversation turns (which would be exceedingly slow). In order for the search and replace to work, it has to be able to guarantee that the regex does not corrupt the output JSON.

You did mention deleting tho - that is easily done in bulk. Simply do the regex search in Filter tab to filter to the unwanted results, go to the "Select" tab and select however many entries you want by clicking Select Range (defaults to ALL). Only filtered results get selected.

image

Then, click Erase Selected to remove them all at once.