DocNow / twarc

A command line tool (and Python library) for archiving Twitter JSON
https://twarc-project.readthedocs.io
MIT License
1.37k stars 255 forks source link

New command to apply Compliance objects to dataset #556

Open igorbrigadir opened 3 years ago

igorbrigadir commented 3 years ago

There should be a command to actually apply the compliance objects to a dataset. Currently you can grab a list of IDs, get the compliance job results, but there's no command that will do the actual filtering.

I'm thinking of a command like:

twarc2 compliance apply dataset.json compliance.json result.json

That will take a dataset, compliance results, and output a clean, compliant dataset. (Either full objects or IDs)

SamHames commented 3 years ago

Big thumbs up from me. I'd be happy to collaborate on that one, if for no other reason than to have a consistent way to do this kind of filtering (ie, do we remove referenced tweets that were part of the deleted tweets but still present in includes?)

igorbrigadir commented 3 years ago

Yeah sure thing! I haven't started this at all yet