Add functionality to redact/filter sensitive data

ErikBjare commented 8 years ago

We need a model to filter out sensitive data by default.

For example if a window title contains: "[title] - Firefox (Private Browsing)" we should redact [title] to some magic string such as "REDACTED".

For some cases we might want to filter the window out entirely, giving 0 information about which window is running, better catch too much than too little.

It should be the goal that every user has a set of "clean" data. The filtering should also be able to be run on an existing database of data, so that cleaner data can be output. Preferably, the data should be so clean that there is little (or even no) reason not to share it (which would be great since easy access to a large dataset could make research in some areas a lot easier!).

The question left is where this processing step should take place. We want the filtering/redacting to happen before data is sent anywhere but it should also be able to be enforceable on a server (if the server owner doesn't trust the servers security, if in the cloud for example) and have clients notified of this so that they can do the filtering on their side, removing the need to send sensitive data at all. It might therefore be prudent to write a module in aw-core that implements this functionality since it should be useable from the server and all clients (which transmit sensitive data).

This feature should be on by default, we don't need anything advanced yet, first priority is to redact titles from Incognito/Private Browsing, that's a good step in the right direction.

This should have a far higher priority than Zero-Knowledge storage right now, because it's a lot easier and is more user friendly (In ZK storage: if you lose your keys you lose your data).

Useful when:

We want to export data to a 3rd party service but don't need them to know all the details.
We want to do overview analysis where full detail is not necessary.
We want to redact some information in the log, such as the window titles of Incognito/Private browser-windows, Tor Browser, etc.

This issue was originally moved from https://github.com/ActivityWatch/aw-server/issues/4 which ended up here because it ended up having wider scope not only relating to aw-server.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/34506378-add-functionality-to-redact-filter-sensitive-data?utm_campaign=plugin&utm_content=tracker%2F35920020&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F35920020&utm_medium=issues&utm_source=github).

ErikBjare commented 8 years ago

It might be better to do redaction by simply removing the window-title property of the event entirely.

ErikBjare commented 7 years ago

I've done some basic work in https://github.com/ActivityWatch/aw-analyser/commit/eb45bf691ae128fcda9010c5500e5bba09e59ad1

unode commented 6 years ago

Would it make more sense to instead of redaction implement some kind of data encryption? Different users might have different opinions on what constitutes sensitive data...

For instance with some from of asymmetrical cryptography (public/private key) encryption would add no requirements but accessing the data would require a private key and/or a password.

Otto-AA commented 6 years ago

I'd also really like to have a filtering feature, so I'll add my two cents here. (Edit: Well, that grew to a bit more than two cents... o.O)

In my wanton imagination it would look something like this:

1) main idea

Via a html form (on the activitywatch website) the user can create filters (e.g. delete event if event.data.incognito == true). He then can apply these to (1.1) filter future events sent to the server (from all watchers) and (1.2) remove already existing events in the database.

2) Filters

Filters should consist of two parts, namely (2.1) Filter criteria e.g. if event.data.incognito == true (2.2) Filter action e.g. then remove event.title

2.1) Filter criteria

Specifies when the filter should be applied. Following functions would be nice:

(2.1.1) if [title/data.incognito] [equals/differs/includes/regex/>/>=/</<=] [comparison]

Examples: if event.title includes 'Private Browsing' if event.data.incognito equals true if event.data.nested.val differs 'abc' if event.data.val regex i_dunno_about_regex if event.data.count > 10

(2.1.2) logical operators

Examples: if event.title includes 'Private Browsing' and event.data.isSensitive equals true if event.title includes 'Private Browsing' or event.data.isSensitive equals true

(2.1.3) metadata checks [watcher_name, is_test, ...]

e.g. if watcher_name equals 'aw-watcher-vscode'

(2.1.4) time ranges

Examples: if event.timestamp is in_time_range(7:00, 9:00) if event.timestamp is on_day('Monday')

2.2) Filter action

Specifies what should be done, if applied. Following functions would be nice:

(2.2.1) remove [event/event.data/event.title/...]

Examples: remove event remove event.title

(2.2.2) replace [target] with [val]

Examples: replace event.title with 'REDACTED'

3) Implementation

(3.1) User interface

Filters should be createable via a html interface on the localhost site (http://localhost:5600/filters)

(3.1.1) Filters page

(3.1.1.1) A list of the active filters with the options to [edit/copy/disable/delete] the filter (3.1.1.2) Option to add a new filter

(3.1.2) Add new filter UI

Should be easy to understand for non-coders. Likely with dropdowns and predefined fields. (3.1.2.1) Filter name (3.1.2.2) Filter criteria (see 2.1) (3.1.2.3) Filter action (see 2.2)

(3.2) Server part

Someone knows of a library for that...??? o.o (3.2.1) API endpoint (3.2.2) filter parser (3.2.3) store filter in file/database (3.2.4) filter incoming events by stored filters

(3.3) Standardized events format per bucket

This would be really nice, as we then can give the users a list of available options when creating filters (e.g. data.[dropdown: 1) pizza, 2) pasta, 3) ...]) and for making sure a filter is valid. (3.3.1) Alter create_bucket method to take additional data_structure parameter (3.3.2) On API, check if the sent event matches the data-structure

Notes

Of course, I am realistic that it would take time to implement this, especially if there's no library for this. But from my point of view, this would enhance this tool really much.

Also much of this is just nice-to-have and doesn't need to be implemented right from the beginning. I just thought I would write out everything, so that while developing we can keep an eye on these (and maybe code in a way these other options can be implemented easily)

From next week on, I would have more time for developing, so until then maybe we can discuss if/how we should implement this? :)

Otto-AA commented 6 years ago

Had a bit time, so here is a quick draft showing what I mean with these filters: https://github.com/Otto-AA/aw-filter/blob/master/filter.py After trying out a bit, it actually seems rather easy to implement these filters in python. Thought it would be much more work O.o

Nonetheless, before starting getting into the details we should agree on how we implement it ^_^

Otto-AA commented 6 years ago

Any thoughts on this proposal? If not, I'd do a bit more work and then create a pull request in aw_server

johan-bjareholt commented 6 years ago

@Otto-AA I've only skimmed through it as of now, but seems to be kind of in-line with what we have been thinking aswell. As of now I want to prioritize editor format and visualizations and once that's done the more important feature IMO is tagging (which would feature some similarities in the datastore, making this easier later on). But even more important is making a final 0.8 release.

This task is huge (just planning and prototyping the design would probably be 2 complete days of full work), so I'm not sure if I want to prioritize discussing the design of this as of now. I'm sorry, I really want this feature aswell.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ErikBjare commented 4 years ago

Since categorization is now done, I'd just like to throw out a suggestion: one way to do this is to have a "sensitive/to-redact" category and then wipe the title/URL/app of all the events that match the category.

johan-bjareholt commented 4 years ago

@ErikBjare That is not a good solution in terms of security, to make it truly secure we have to never even add the data to the buckets in the first place, not filter it when querying.

We could add the new settings API to solve this, add a way in the web-ui to add regexes which should be filtered and then let aw-watcher-window check those on startup and filter them before the events get sent,

There's also a duplicate feature request on the forum https://forum.activitywatch.net/t/add-an-exclude-list/345

ErikBjare commented 4 years ago

to make it truly secure we have to never even add the data to the buckets in the first place

Agreed.

not filter it when querying.

That's not what I mean. I mean to classify & filter when a heartbeat is received.

We could add the new settings API to solve this, add a way in the web-ui to add regexes which should be filtered and then let aw-watcher-window check those on startup and filter them before the events get sent,

That makes the watchers depend on the server settings, and also requires us to implement the same filtering in all watchers. It's a bit more secure than what I had in mind since the server would never see the sensitive info at all, but not sure if it's worth it.

It's worth mentioning that the rules themselves are sensitive information, especially if they only contain a few things, making the "anonymity set" for redacted events small. However, it would be less of a problem if we went for deleting events entirely.

In any case, I've been thinking of building a feature in aw-webui that lets you search for events matching a particular pattern, and then let you delete them or replace them with redacted versions of the events. Wouldn't take that much work to build, search would be a generally useful feature anyway, and wouldn't add any code to the server or watchers.

johan-bjareholt commented 4 years ago

That's not what I mean. I mean to classify & filter when a heartbeat is received.

Oh, alright.

Might still be an issue though, either we need to be aware of bucket types (so we for example don't corrupt events in buckets we don't expect to, for example replacing "afk" with "redacted" or something). At that point architecture wise it makes more sense for the watchers to themselves solve redacting sensitive information in a way that matches their event format well.

That makes the watchers depend on the server settings, and also requires us to implement the same filtering in all watchers. It's a bit more secure than what I had in mind since the server would never see the sensitive info at all, but not sure if it's worth it.

Agreed, currently that's just a few watchers (aw-watcher-window and aw-watcher-web) but in the future it might become more.

It's worth mentioning that the rules themselves are sensitive information, especially if they only contain a few things, making the "anonymity set" for redacted events small. However, it would be less of a problem if we went for deleting events entirely.

Very good point, didn't think of that.

In any case, I've been thinking of building a feature in aw-webui that lets you search for events matching a particular pattern, and then let you delete them or replace them with redacted versions of the events. Wouldn't take that much work to build, search would be a generally useful feature anyway, and wouldn't add any code to the server or watchers.

Definitely a good start!

Not sure myself which one of our suggested solutions are the best, both have their pros and cons really.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ErikBjare commented 3 years ago

I've added a new example redact_sensitive.py to aw-client that can be used to redact sensitive events: https://github.com/ActivityWatch/aw-client/blob/master/examples/redact_sensitive.py

pcuci commented 3 years ago

Why not encrypt the data going in?

The goal should be to not leak any private information if your machine gets hacked (unfortunately very common)

Then require a 2FA to view your own data.

NathanaelA commented 2 years ago

I've been looking at doing a change the Rust server engine to just filter out events on its side. So no matter which tracker is sending data to the server, the server itself is responsible to filter them out based on regex matches.

Looking at making the configuration be part of the server config file at this point for simplicity, but ultimately I think having a filter table built into the database would be useful so then the front end could easily then send new filters to the backend.

@ErikBjare @johan-bjareholt - Would this be a PR you would be interested in.

redactedscribe commented 1 year ago

Just to throw out a couple of ideas relating to this/window titles that I'd like to see realised (some points mentioned earlier by others):

A way to:

Encrypt all of your data.
Mark certain window titles as sensitive but still have them logged. Think something similar to Discord's Streamer Mode: Keep all the data, but have some of it obscured/omitted if you want to share your stats, e.g. via screenshots of graphs, for example. Could be useful for sanitizing exports for debugging purposes too.
Mark certain window titles as blacklisted and never have them logged. For example, some window titles are spammy and of little relevance, such as those with a countdown timer in their title. Also, private browsing, as mentioned before.
Treat certain window titles as the same thing for display purposes. E.g. for the timer example, group all those window title entries as one since they're conceptually the same. Perhaps this would be achieved through user Regex rules, or maybe a user settable "window title variance" slider that fuzzily matches on how similar a title is to others, but in either case, on a per-application basis.
Merge window titles. After a rule is made for display purposes (above), an option to move it before the logging function to permanently merge captured titles as they come in. Less data but of higher quality if done right.

I've known about ActivityWatch for many years and have probably installed it once every year or two, but the lack of any way to disable window title capturing completely has always caused me to inevitably uninstall. Until there are ways for a user to handle window title capturing, an on/off switch would be great. Excuse any ignorance as my overall experience with ActivityWatch is quite limited. I hope that will change because I'd really like to use your useful program.

Thanks.

dennisorlando commented 9 months ago

Shouldn't this be as easy as creating a filtering list on the server side, thus "if entry has a match with a filter, don't add it to the sql database"? (I mean, I suppose not... else it wouldn't be open since 2016 🥲)

I can easily create a Category with a "Private browsing" pattern which correctly identifies all my "Private Browsing" data; A really simple button named like "filter out data from this category" would work perfectly well for a lot of people.

Currently, there is no solution nor compromise which would fix / alleviate the problem, apart from using this pull or running "redact_sensitive.py" periodically.

pcuci commented 9 months ago

Doesn't Chrome already know where you've been? (Unless you turned off all settings to track you?)

I believe most AW users' expectations differ wildly from those of a https://www.qubes-os.org/ user

If you really, really can't trust yourself with what you're doing on your computer, simply use a different operating system that allows you to hide entire compute workloads from yourself.

ActivityWatch, in my mind, is not for PEPs or investigative journalists, it's for everyone else who wants more control (but not total control, as if that were even a thing...) over their digital crumbs, and trusts themselves enough with a local database, on a non-air-gapped computer, likely connected to the internet.

If you need even lower level trust, go for https://puri.sm/ with Qubes on it :-)

No need to overcomplexify AW, IMO

codermrrob commented 8 months ago

The default should be to respect private browsing, with opt-out option if somebody wants to record that. Mostly people will not want to record private browsing time, which by default for most people is not work related anyway.

pcuci commented 8 months ago

I do want to record private browsing time.

The reason I use private browsing and VPNs is to hide my activity from others on the web, not from myself. The reason I use AW is to surface insights into my own digital behavior (on and off the web, work and personal, both), private browsing included.

Actually, I use multiple computers (and VMs) and I'd like to track my behavior across all these (virtual) devices, not just my "main" device.

I do trust my LAN/VPNs to not be compromised... and AW fits the bill quite nicely. :-)

ActivityWatch / activitywatch