Syncing - Githubissues

ErikBjare commented 7 years ago

Vote on this issue on the forum!

There are two usage issues with ActivityWatch at the moment to which syncing is a solution:

If you use more than one device, you need to check every device individually, or run one centralized instance of aw-server (not recommended!)
If a machine is lost, so is the data (the user could have exported it, but data stored after the export would still be lost). While ActivityWatch cannot replace a proper backup system, syncing could help by storing copies of the data across devices.

I know of two interesting solutions to this problem:

Centralized server which stores all data encrypted (the server is unable to decrypt)
- Issues: Centralized, single point of failure
- Done by @StandardNotes
P2P synchronization (encrypted, possibly including relays)
- Done by @Syncthing very well, perhaps we could use it in some way. Also: MPL2 licensed and written in Go.
  - Downside: Clients must be online at the same time for sync.
  - They have the ability to set some folders to "read only", useful when you want to ensure the data stays intact in its source.
- Implementing it ourselves would be an enormous effort, I assume.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/43187921-syncing?utm_campaign=plugin&utm_content=tracker%2F35920020&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F35920020&utm_medium=issues&utm_source=github).

ErikBjare commented 7 years ago

@calmh might know a thing or two about using Syncthing in an application-specific context like this. I haven't seen it done before so we might want to check with him before we start.

I've taken a look at the arguments to Syncthing and found -home which can be used to set a custom configuration directory. So pretty promising.

ErikBjare commented 7 years ago

I've started prototyping something small here: https://github.com/ActivityWatch/aw-syncthing/

Could be made to work both with standalone Syncthing and bundled Syncthing, but standalone would probably be preferred due to the dependency on the Python package syncthing which targets a specific version (right now targets 0.14.24 and the latest is 0.14.25).

What it does:

[x] Moves the database file to a specific location
[x] Creates a symlink from the new to the old location (so aw-server will just follow the symlink to the file)
[x] Starts Syncthing with a custom configuration directory
[x] Configures Syncthing via the REST API to add the new database folder as a synced folder
[ ] Add another device to sync the folder with

calmh commented 7 years ago

From a Syncthing point of view there's no real difference to it - you just start it up, point -home towards somewhere suitable, and configure it appropriately. You'll need to do the exchange of device IDs and so on somehow. As for the Python package, I see that it mentions 0.14.24 specifically but probably only because that was the latest when the README was written. All of 0.14.x speak the same API so there is no difference (and 0.12, 0.13 as well for the absolute most part).

ErikBjare commented 7 years ago

@calmh: Awesome! I'll let you know when we have a working release.

ErikBjare commented 7 years ago

I've started using Standard Notes recently (finally getting off Evernote) and have been impressed by the architecture. They have designed a neat data format/server called Standard File that defines how data should be encrypted and stored both client-side and server-side. Definitely something to check out.

Edit: It's interesting, but I'd rather have it distributed than just decentralized.

ErikBjare commented 7 years ago

I've been thinking about this a bit more.

My current idea is to simply configure a folder as a synced databases-folder. Basically aw-server would copy local data to this folder on a regular basis.

This folder could then be synced with Syncthing, Dropbox, or Gdrive (we should probably explicitly recommend Syncthing). The synced database files would not be allowed to be modified from another host then the one who owns them, since such changes could cause syncing conflicts.

Potential problems:

It would be nice to have the synced databases encrypted
Compressing them would also lead to huge storage savings

ErikBjare commented 6 years ago

Reddittracker turned up this today. Makes it pretty clear that sync is a vital feature for most users.

The best part is that you can put it on all computers (home and work) and on a smartphone. It'll track the software and sites you use on all of them and aggregate it to one account.

hippylover commented 6 years ago

Would be nice if this is implemented that it doesn't add to the system requirements to run the program. So for people who don't need the functionality or would rather just set up a cron job to copy it to a remote server manually can still disable the feature.

ErikBjare commented 6 years ago

@hippylover Noted! Thanks for the feedback.

brizzbane commented 6 years ago

I googled activitywatch + backup. Trying to locate where the data is stored. Would be really nice to be able to 'set' where the data is stored.

Backup solution I use--is to put important stuff I'm working on under Dropbox or MEGA. I'm under linux...and I actually add a home directory, so that I ...guess it makes me more 'aware' that it's Dropbox data.

I just read through the above comments--supposedly MEGA is end to end encrypted. I started using because more free data.. but it has the bonus of not having to mess with an encryption solution if you want to store it encrypted.

1000i100 commented 6 years ago

your sync look like auto-backup to me (or i've miss understoud)

how do you merge activity from multiples devices ?

if i was in charge, i probably use git as sync/merge tool if the data are stored in plain text files. But i've not explore your code base to make me an idea if it's a good way or not for this projet.

johan-bjareholt commented 6 years ago

@1000i100 The difference between sync and auto-backup would be that in auto-backup there's a definition of a producer and a consumer while in sync it doesn't, and by that definition we might actually refer to auto-backup yes.

Merging activity from multiple devices is not an issue as long as the one device you are requesting data from has the data for all the devices you want to view. Each kind of data is separated by activity type per each host which we call buckets.

Plaintext is simply not scalable and therefore git is out of the question. If we have 500MB of data and convert it back and forth between a database and plaintext file it would be incredibly slow.

ErikBjare commented 6 years ago

Started working on something small as an experiment: https://github.com/ActivityWatch/aw-server/pull/50

madumlao commented 5 years ago

raises hand

Just wondering - isn't the storage a database? syncthing doesn't handle database syncing.

johan-bjareholt commented 5 years ago

@madumlao I don't get that either, syncthing syncs file by file and it is near impossible to do a diff of a binary sqlite fine. The database can easily grow past 100MB and it's not viable to sync such a large file frequently.

ErikBjare commented 5 years ago

@madumlao Correct, but the database is stored in a file, which can be synced.

@johan-bjareholt Syncthing is smart enough to not sync the entire file if only parts of it have changed, see: https://forum.syncthing.net/t/noob-question-incremental-sync/1030/17

johan-bjareholt commented 5 years ago

@ErikBjare Oh nice. Googled a bit on the sqlite database files and they seem to be paged so that should be fine then. I just assumed that it was as bad as git when comparing binaries but apparently they have solved that issue.

johan-bjareholt commented 5 years ago

Would syncing with syncthing also mean that we will have multiple database files? In that case we might need a lot of refactoring.

madumlao commented 5 years ago

@ErikBjare I'm not convinced that an SQLite db will survive syncthing. At best case you'll lose transactions done on one side, at worse case you'll have a mispaired hot journal which will corrupt the whole db. Effectively, if an aw-server process is running on two machines there's going to be contention.

https://www.sqlite.org/howtocorrupt.html

The only way that syncthing, rsync, or similar process is going to be "safe" is if each transaction is a separate file, but I guarantee that that's going to be bad. You really need to implement some kind of peer to peer syncing db, such as for example, a multi-master LDAP.

ErikBjare commented 5 years ago

@johan-bjareholt Yes, each instance would write to its own file in the synced folder(s) (there are some benefits to having one Syncthing-folder per instance, as Syncthing can enforce "master copies" preventing accidental deletion/corruption on other machines). An instance would therefore have read-only access to database files from remote machines. I don't think this requires any major refactoring.

@madumlao I am aware, I'm not proposing we sync a single sqlite database file.

I thought I had mentioned it in the issue before, but I realize now that I hadn't. Hopefully this should clear things up: I'm not proposing two-way sync in the sense that you can edit remote buckets, only read them (and create copies, which you could in turn modify).

madumlao commented 5 years ago

I see. A full-on p2p system would be very much appreciated. I have a case where I have multiple laptops / devices that all move around. Unless I set up a single server and configured all clients (including firefox extensions etc) to talk to that server, my activity watchers will all have gaps in activitytracking, defeating the purpose of review.

Ideally a user who has multiple devices can transfer in between devices with little setup, and the tracking will follow them throughout.

Maybe the laziest / easiest way to do this without major rearchitecting is to use periodic "sync checkpoints", which would basically: 1) generate periodic sqlite dumps into some shared syncthing 2) upon startup (or periodically), check the shared syncthing folder for all sqlite dumps made by other nodes and import any transaction later than the "last remote transaction synced" 3) write down the "last remote transaction synced" somewhere for tracking

Could be implemented as a separate watcher-like process.

(My assumption is that tracking events are largely just additive transactions, there is little editing done)

By the way, I have no idea where the sqlite database is saved. Any pointers?

ErikBjare commented 5 years ago

@madumlao That's almost the exact design I had in mind for the MVP, nice to see we arrived at the solution independently!

We use appdirs to manage files like the database, caches, and logs. So check /home/<USER>/.local/share/activitywatch/aw-server if you're on Linux, or the appdirs documentation for user_data_dir otherwise.

x-ji commented 5 years ago

Just to be sure, there is currently no across-device syncing available yet, right? If so, once syncing available I'd gladly switch from RescueTime. I constantly switch across different computers.

johan-bjareholt commented 5 years ago

@x-ji No it's sadly not available yet.

jancborchardt commented 5 years ago

What might also be interesting is some integration with Nextcloud (disclaimer: I'm designer there :)

The ideals of the projects are quite aligned: being in control of your data.
Nextcloud is already reasonably widely adopted. That means you don't need to write an extra server, and people don't need to install something extra.
We support MySQL/MariaDB, PostgreSQL and SQLite (via some db abstraction I guess) cc @rullzer @MorrisJobke for technical questions.
There could be a server-side Nextcloud app which displays the data too. Since the desktop dashboard is already a web interface, that could be reused.

What do you think?

ErikBjare commented 5 years ago

@jancborchardt I like Nextcloud, but I don't think that's a direction we want the core project to go in (and I'm pretty excited about building a decentralized sync feature for a "localhosted" application).

I could elaborate, but I don't want to be overly critical (as I sometimes can be) so I'm just going to leave it at that :slightly_smiling_face:

However, if you're interested in making a business case out of it we're all ears! (and please let me know what you think of my reply in #257, that's really interesting for us)

zeonin commented 5 years ago

I definitely agree with not tying the core AW project to a specific sync implementation. As long as the abstraction is on the file level, it's totally application agnostic which is definitely great from a "my data, my way" perspective. It lets users choose how (or even if) they want to synchronize.

If having Nextcloud integration is a priority, AFAICT all that's needed is an instance of aw-server running on the Nextcloud box (or somewhere it can reach) and a Nextcloud webapp to interface with it.

Maistho commented 5 years ago

Personally I would much prefer having a centralized server. It seems to me like implementing some security on the communication between servers and clients would be a lot simpler than implementing some kind of p2p sync between servers.

For my use-case, where I have a single computer that runs both linux and windows with dualbooting, I will never have both servers running at once anyway, so any syncing would need to go through some 3rd host regardless. Running a single server on a seperate host seems like a much easier solution.

I'm up for implementing the security needed on the server.

What would you want to see in a PR in order to merge support for having a single server for multiple clients/devices?

ErikBjare commented 5 years ago

@Maistho Basically just HTTP authentication, preferably using OAuth in some way.

Would require password-protecting the web UI as well as adding a configuration option to aw-client to include the HTTP auth key. I'm a bit rusty on OAuth, but that's the gist of it.

Edit: Oh, and tests, lots of tests.

Edit 2: And HTTPS...

jancborchardt commented 5 years ago

I like Nextcloud, but I don't think that's a direction we want the core project to go in (and I'm pretty excited about building a decentralized sync feature for a "localhosted" application).

It’s your call of course. :) It just seems that you want to develop an activity tracking app, already have limited time for that – and then working on a sync server will take even more focus away from that?

Nextcloud could even just be one of many, by simply supporting WebDAV for syncing. Yay for open standards. ;) And another point is ease of setting up: If you want ActivityWatch to be accessible and usable by lots of people, it has to be dead simple. If for syncing you have to set up your own separate server, that’s a dealbreaker.

ErikBjare commented 5 years ago

@calmh Do you think progress on https://github.com/syncthing/syncthing/issues/4085 could help us achieve this? Looks like a really good fit for us.

kirkpsmith commented 5 years ago

I have been using ActivityWatch for a few months now and Nextcloud for a bit longer - I think it'd be best to not reinvent the wheel, and offer sync functionality along the lines of other great projects like Joplin, KeeWeb, and Zotero - I sync all of these apps and services with Nextcloud (WebDAV or pointing apps to same filespace on synced folder), but could just as easily switch to another syncing service. No Nextcloud apps involved, though that could offer extra functionality. I'd really like to just point an ActivityWatch instance to a WebDAV URL and provide a password and then forget about it.

johan-bjareholt commented 5 years ago

@kirkpsmith As far as I understand none of those will work as they sync on a file-by-file basis and do not support partial updates. In activitywatch we have one database which easily grows above 100mb and syncing such a large file back and forth is not an option.

unode commented 5 years ago

Adding to the pool of options: https://github.com/rqlite/rqlite

I also wonder if it would make more sense to simply change the underlying storage/db to one that supports replication/sync. There's also a reasonable wikipedia article listing https://en.wikipedia.org/wiki/Multi-master_replication

ErikBjare commented 5 years ago

@unode That looks pretty cool, but it's only available in Go and doesn't really make for a smooth end-to-end solution either since it would require the user to open ports, manually enter IPs, and elect a leader for each database file.

Doing it the Syncthing way would solve device pairing and NAT traversal and would work with standard SQLite available on all platforms.

I invite anyone with some time on their hands to try it though! It shouldn't take long to get something working (unless calling Go from Python/Rust is very cumbersome), but it won't work without significant effort (IP forwarding, static IP) for most of our users.

unode commented 5 years ago

@ErikBjare I'm not entirely sure what your vision for SQLite + syncthing is but from what I read above there are two independent problems being lumped together.

How to make the data reach other clients (syncthing, nextcloud, owncloud, gdrive, dropbox, NFS (why not if on a local network?), a distributed filesystem, you-name-it-sync)
How to make each ActivityWatch instance both a server (creator) and client (consumer) of the data to be syncronized.

For 1. there are plenty of solutions which have their own tradeoffs. Syncthing requires multiple online clients and ideally a star configuration (all clients talk to all clients) but some users may prefer a centralized option if, for instance, clients are never online simultaneously (extreme case, dual/multi boot on same machine). This point could very much be up to the user. There's a folder where content is created and the user is free to chose what works best.

In my opinion 2 is the harder task and one that I think might be worth either:

using a database that already implements replication
re-implementing replication inspired/based on an already existing solution.

For 1. I'm mostly familiar with PostgreSQL streaming log system which I think might fit here. Works well in an occasionally online model. It does require some mechanism to know when all clients read/consumed the log in order to release space but that's secondary. Most mainstream DBMS implement some kind of replication solution as well (MySQL, MongoDB, CouchDB, ...). However, the above discussion seems to be going in the direction of 2. Personally I'd avoid this. It's a massive project on its own with tons of edge cases and situations where users are very likely to run in to problems. Not to mention the difficulty of reproducing any kind of bug affecting this system. Took years for some DBMS to reach their current maturity.

johan-bjareholt commented 5 years ago

How to make the data reach other clients (syncthing, nextcloud, owncloud, gdrive, dropbox, NFS (why not if on a local network?), a distributed filesystem, you-name-it-sync)

This point could very much be up to the user. There's a folder where content is created and the user is free to chose what works best.

This is what our current prototype is, the ability to choose a folder where to store one database for each machine and then making the databases which the current host does not own read-only.

For 1. I'm mostly familiar with PostgreSQL streaming log system which I think might fit here. Works well in an occasionally online model. It does require some mechanism to know when all clients read/consumed the log in order to release space but that's secondary. Most mainstream DBMS implement some kind of replication solution as well (MySQL, MongoDB, CouchDB, ...).

A full database will never an option as it is too heavy, we can't have a syncing feature which requires over 100MB of RAM. On top of that we need much more than just database support to sync data, we need a way for clients to connect to each other without requiring the user to open ports on his network which is a more complicated matter when not having a centralized server.

However, the above discussion seems to be going in the direction of 2. Personally I'd avoid this. It's a massive project on its own with tons of edge cases and situations where users are very likely to run in to problems. Not to mention the difficulty of reproducing any kind of bug affecting this system. Took years for some DBMS to reach their current maturity.

This will not be as big of an issue for us as for other database solutions as we have clear owners of each bucket and can even have one database (sqlite file) for each host.

dreamflasher commented 5 years ago

This will not be as big of an issue for us as for other database solutions as we have clear owners of each bucket and can even have one database (sqlite file) for each host.

Okay, but where's the difficulty then? Merging is the only difficult part in syncing, if that's not part of it; why not simply let users sync their folders with Dropbox/Nextcloud/etc?

johan-bjareholt commented 5 years ago

@dreamflasher I though I had stated that's exactly what out current prototype is, maybe I was not clear enough.

This is what our current prototype is, the ability to choose a folder where to store one database for each machine and then making the databases which the current host does not own read-only.

It's not a perfect solution, but that's going to be our first MVP.

ErikBjare commented 5 years ago

Okay, but where's the difficulty then?

The only difficulty is for me to find the time to implement it. Which will hopefully be soon as I'll have a decent amount of free time after my exam tomorrow.

dreamflasher commented 4 years ago

@ErikBjare Do you have an update for us? Thank you! :)

ErikBjare commented 4 years ago

Hey @dreamflasher, I got caught up with working on categorization instead (which is working and released! I hope you like it).

Syncing is now definitely the next big thing (the votes for requested features on the forum are quite clear).

There is a prototype in Python here: https://github.com/ActivityWatch/aw-server/pull/50

And some initial progress on the final syncing implementation in Rust here: https://github.com/ActivityWatch/aw-server-rust/pull/71

It will be done sometime in 2020, but since I have my masters thesis coming up I can't promise when. Hopefully it's only a few months away 🙂

dioptx commented 4 years ago

@ErikBjare I'm willing to give a helping hand on the syncing! Let me know if it's possible!

2br-2b commented 4 years ago

I'm looking forward to this feature so I can replace RescueTime! Will people be able to self-host the server?

ErikBjare commented 4 years ago

@2br-2b Self-hosting the server is the only thing we support, it's not supported to host it remotely.

It's not really a "server" as much as a just a backend/node for the frontend. You only have to provide a synced folder (like Dropbox or Syncthing) for sync to work, when it's released.

miguelrochefort commented 4 years ago

Any progress on this? What's the main challenge? Is there any way I can help?

ErikBjare commented 4 years ago

The main challenge right now is that we first have to complete our migration to make aw-server-rust the default, then work can continue.

On Fri, 1 May 2020, 19:54 Miguel Rochefort, notifications@github.com wrote:

Any progress on this? What's the main challenge? Is there any way I can help?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ActivityWatch/activitywatch/issues/35#issuecomment-622492775, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKXDOQNQUMBCOHTJDNKGF3RPMEG7ANCNFSM4DEGJRJQ .

qins commented 4 years ago

Without syncing, it's really not cross platform, though AW is better than RescueTime in many respect.

update for the misunderstanding

it's really not cross platform

Please note the words: "really" and emphasized "cross". Of course, the original meaning of regular cross-platform is "running on multiple platforms"

johan-bjareholt commented 4 years ago

@qins Cross platform doesn't refer to communication between platforms, only the same application running on multiple platforms. See Wikipedia

In computing, cross-platform software (also multi-platform software or platform-independent software) is computer software that is implemented on multiple computing platforms.

https://en.m.wikipedia.org/wiki/Cross-platform_software

jtagcat commented 4 years ago

It hasn't been brought out before, Syncthing also handles compression, encryption, you can have one node on an always-on device (no need for all devices to be online).

You can also do read-only folders, but you can't have subfolders read-only (if the user uses ST for anything other, they'd end up an extra folder per-device).

I have never used the app, it seems that syncing is hard. Could you provide a directory, that is 'we are not responsible if data corruption' and can be synced between devices asap?

ActivityWatch / activitywatch

Syncing #35

update for the misunderstanding