Collaborating on the same brightway dataset

jan-eat commented 4 years ago

I'm trying to find a way for two or more people (non-developers) to collaborate on the same dataset/project.

The workflow could be as follows:

import a project from another person
work / perform changes
export the project
merge changes with someone else For this, one might have to use text files (possibly JSONs), as binaries/DBs are somewhat hard to merge.

An idea would be to use the JSONDatabase to directly store everything as JSONs and then simply check these into version control. Would this work? Which other files would one need to keep track of and check into git? This also assumes that the JSONDatabase backend is currently functional. As it is not supported in backend conversion, which parts of it are currently not functional? Or would one need re-write it?

If there is an easier way to implement this workflow easily, I'm all ears!

I guess a common SQL database might work, but could be difficult to handle with locks, concurrency etc, and I don't see a good way to handle data conflicts here.

BenPortner commented 4 years ago

Hi @jan-eat,

no solution, just a comment. We had a similar question recently in the mailing list. The poster mentioned that they are using the same SQL database institute-wide so maybe they can give you a hint how to manage concurrency and locks.

Concerning the JSONDatabase, I have not worked with this backend so unfortunately I cannot say anything about it. Managing it with git does not seem like the worst option to me though.

jan-eat commented 4 years ago

Hi @BenPortner , I guess there is no solution for it yet, but others have shown some interest in it as well. As there seems to be some major changes/work ongoing on the master branch here, it would be interesting to know if any new developments were made in this direction. @cmutel , it would be great if you could give a quick update on this ;)

cmutel commented 3 years ago

@jan-eat @BenPortner A good discussion to have, and something that needs to be easier!

As you are no doubt well aware, any database design comes with a set of trade offs. The current version is the way it is because of the people who have been using BW in the last ten years. While there are some advanced people, most users are beginners, and so having something simple became a priority. Unfortunately, this does make use cases such as yours more difficult.

bw2data does allow you to define your own database backend, and that can be SQL, NoSQL, flat files, etc. You have to think about what makes sense for your organization. For example, for me, the concept of projects is an absolute necessity, as each project/paper needs to have an independent database that can't be messed up by other changes. Using the default backend creates a new subdirectory for each project, though this effect could also be introduced by having a single database with another column for project name. It could be projects are not useful to you, and you could skip this layer of complexity.

In your case, I would think that the easiest thing might be to create a new SQL backend that points to a single Postgres database. Postgres supports concurrent writes has everything you would want in a database. You could even reuse the peewee wrapping code.

The JSONDatabase backend is not deprecated, exactly, but it is not loved, and has a number of downsides. The only real upside is versioning in plain text via git. For the workflow you have above, another reasonable option could be to export the database to JSON using wurst, and then diff/merge/etc with this. But I have to admit this feels a bit awkward - if you need a single source of truth, use a database designed to do exactly that for decades.

jan-eat commented 3 years ago

@cmutel thanks for this long explanation. I think I see three options now:

Flat files via JSONDatabase: this seems a bit clunky. Initially I thought versioning the files in Git might be nice, but maybe the repo would just become too big, and it doesn't really feel natural.
Postgres backend: I like this idea a lot, it seems like the most natural thing to do. A question here would be how Brightway handles caching. E.g. when you have two people working concurrently, how would their locally run Brightway instances diverge when one performs changes? Or does Brightway operate directly on the database?
Running Jupyter notebook on a server (e.g. in the cloud). Data can be stored on the server, and people can simply connect to the same machine to use Brightway. This would reduce complexity for people who would usually need to set up Brightway on their local machine. And server resources are easier to increase. Downsides are:
- Still need to implement a good back up solution (e.g. with a managed SQL backend).
- not being able to work offline (should be okay in our case)

For the third option I'd probably try to throw Brightway into a fully configured docker container for easy deployment, mount the data volume into it (or use SQL as a backend). Would be interested in building a Postgres backend, but will need to check internally if we have the resources.

Is there a mechanism to store entire projects in a database, or would that have to be developed as well?

mklarmann commented 3 years ago

We need to heavily rely on Activity Browser, so the question came up, if there is actually a possibility to run a local brightway instance (with the Activity Browser), that accesses a shared database for the team. Is this how you have been doing it @BenPortner ?

@cmutel what would be the downside of the ´JSONDatabase´? Does it currently work?

@jan-eat I was also thinking about a manual option. That the user can just export the file he worked on, on provide that for import to other user. @cmutel does one if these exports work properly for this cause here: https://github.com/brightway-lca/brightway2-io/tree/master/bw2io/export

BenPortner commented 3 years ago

@mklarmann Unfortunately I have no experience using brightway or ab in shared mode. Using a shared database should be possible though. It seems to me that @cmutel recently implemented a function to change the default data directory. I have not tested it myself but maybe this is a place to start?

cmutel commented 3 years ago

One option, if you are reading more than writing, is to just store the database on something like Dropbox, or a shared mounted network drive.

You won't have problems having multiple people write inventory data - SQLite only allows one write at a time (you might have to restart you Python notebook/AB instance, however). You could have problems, though, if you have creating new databases or LCIA methods, as this data is read from JSON files once when the Python interpreter starts, and then saved (but not reread) when changes are made. So if A and B start at the same time, and both make changes, then A's changes will be overwritten by B if B saves later.

Sorry, I know this is not ideal!

Having truly pluggable database backends is a priority, but isn't coming tomorrow... it requires changes in many places, in addition to documentation, tests, and BTW also writing the different backends!

brightway-lca / brightway2-data

Collaborating on the same brightway dataset #76