callat-qcd / espressodb

Science database interface using Django as the content manager.
https://espressodb.readthedocs.io
BSD 3-Clause "New" or "Revised" License
8 stars 3 forks source link

[JOSS Review] Reply to referee @remram44 #50

Closed ckoerber closed 4 years ago

ckoerber commented 4 years ago

Hello @remram44,

We would like to thank you for your detailed reply. We have split our response into two parts (according to the divider in your original post).

there is a branch v1.0.0 with the same name as the tag v1.0.0...

Good point. The branch v1.0.0 has been removed and we will stick to the following convention: Branches will consistently start with a v (like v1.0.0) and as soon as they are merged into master, a tag with the same name will be created so we can delete the merged branch.

the blackmagicsorcery module stands out a bit, especially now that it just re-exports Python's standard re module (since 31c69150). You should probably just import re directly.

You are the first person to find this Easter egg :sweat_smile:. We believe this is not too critical as all import statements are either

from espressodb.base.utilities import blackmagicsorcery as re

or

from espressodb.base.utilities.blackmagicsorcery import search

Regarding the paper markup...

Indeed: Django monospace will be fixed and since "Leadership computing facilities" is a name (see, e.g., https://www.olcf.ornl.gov) we will capitalize it instead.

https://github.com/openjournals/joss-reviews/issues/2007

ckoerber commented 4 years ago

This addresses the second part of your review. We believe that your critique has sparked ideas strengthening EspressoDB. We have created the following issues: which we hope to include in the next week.

Detailed responses

I am not sure of the value added compared to Django, which provides a lot of the claimed features of EspressoDB, some of which being just re-exported, such as database management, signals, and automatically creating views from models (Django has the "admin" pages).

We disagree that we "just re-exported" functionality provided by Django (other than the regex Easter egg). This is partly addressed in the repsonse to @ixjlyons, and we elaborate some here. Using ourselves as a specific example, all but one of us would not be capable of setting up the project structure we encounter using just Django in a reasonable time frame. One of us created EspressoDB, and a few of us were able to help create LatteDB, meaning, the barrier to setting up a project is lower than what is required of pre-existing software. We know there are many such academic groups in the world similar to us.

The goal of EspressoDB is to build upon the functionality of Django, including adding enhanced features, while providing a simpler user interface. We added functionality to Django to streamline the process of generating a data management platform and to promote data consistency. It was a guiding principle to stay close to Django to utilize community support, additional modules, and available documentation. Furthermore, these guidelines are essential to guarantee long time support of EspressoDB. As an example, we started the development of EspressoDB using Django version 2.2 and were able to incorporate backward-incompatible changes made by Django version 3.0 the same day it was released.

The most significant additions we have introduced are hidden in the Base class which extends Djangos models.Model. This class hides the functionality needed to auto-detect models and to find recursive dependencies to other tables (ForeignKeys). This is needed to simplify the nested population of data and spawn default views. As examples, we add the documentation view (access to table details with references), the population view (UI for creating python scripts to populate nested tables), and a notification interface. Furthermore, creating EspressoDB apps automatically register models, links and views plus it provides default templates. All these steps significantly speed up and simplify the process of creating a project.

It seems to me that it might be more useful if it was easier to integrate with existing data and processes? Creating a full Django project is heavy (project folder, app folder, migrations, ... are still there and need to be dealt with manually).

Indeed, our goal was to make it simpler for users to set up a project than what is required to do this with Django directly (see above). We hope that EspressoDB lowers the barrier for other users to implement and that the doc streamlines this process sufficiently. The main reasons why we still have the Django project/app structure are:

  1. Build upon the already great functionality of Django;
  2. Want to stay close to Django to not introduce backward-incompatible changes for non-major Django pushes;
  3. Utilize existing documentation and online FAQs. E.g., if you have an EspressoDB question, there might already be a Django answer.

If this becomes more widely used, we envision creating a project setup GUI interface to make it even simpler.

Maybe it should have more features to keep track of experiment runs and data dependencies (e.g. which data was created from which version of which data), kind of like a workflow system?

We agree that meta-information on how data was generated is essential for reproducibility. Depending on how tables are implemented, the tracking of dependencies can be easily realized through ForeignKeys. To store additional (default) meta information, we have added the mandatory user (by database access) and last_modified columns which are present in each model inheriting from the Base class and are automatically stored.

Unfortunately, we have encountered that users were not always working with clean code revisions. Thus, we decided to make the tag column optional. We have added new functionality which simplifies storing code revisions and added a remark in the documentation of the example project and added a new features doc entry (see the pre-save feature).

This example update script has to manually check the data, counting the records to figure out if the data is up to date or not.

The example script in the documentation is just a simplistic example with the purpose to intuitively visualize the basic usage. As such, yes, it does not display best practices when it comes to storing data. It should rather display the necessary steps to interface with EspressoDB and interact with data.

As a side note: In this scenario, one inserts all of the entries for a given category (hamiltonian) at once. Therefore the counting check does make sense--arguably we could also add a completed column to the category, or add a completed property which internally counts the existing entries.

It's not clear how EspressoDB helps "centralize and guarantee" data integrity from my reading of the docs. When multiple versions of the code are around with multiple people working on it, how does this code interact with a centralized database? How does the system deal with revisions of the code?

From the perspective of a scientific project, one challenge is to provide a framework that is easy to interact with (e.g. no prior knowledge of SQL or Django). This encourages collaborators to then perform analysis on the same dataset, and insert results to the same database. Of course this all in principle provided by any database, but is a non-trivial hurdle to get everyone on board without a relatively "plug and play" interface given the varying propensity towards working with databases directly.

In addition, due to the specific use case when applied to lattice QCD calculations, EspressoDB is able to define tables that exactly mirror the computational workflow (which can be thought of as a directed multigraph) through modules provided by the EspressoDB Base class. In particular, with the ability to define constraints on a ManyToMany column, EspressoDB can express a specific computational task (with unique input parameters) as unique rows in the database tables connected by foreign keys. Therefore, the same computational task is never repeated if the constraints are thoroughly implemented. For typical scientific projects that involve thousands to millions of computational tasks, EspressoDB helps ensure the integrity of the data generation process.

However, this is indeed an important concern you raise and we have created new issues (see below) to address these topics. The general answer to these questions depends on what you mean.

  1. Revisions of the EspressoDB project
  2. Revisions of the software which generates data

In general, for a centralized multi-user scenario, you will have a remote database which is linked to the remote version of the EspressoDB project.

  1. Assuming there is no ill intention of users, the most likely scenario of unwanted behavior is that users work with different commits of the EspressoDB project. The most important cross-check would be that the remote database has the same table definitions as the local project. Pushing to different table definitions will fail locally in most of the cases. If this is not the case, potentially intermediate data can be inserted before the script crashes. A fix to this is a cross-check of the table migration history which is run on the import of EspressoDB. On top of that, one could check the version of EspressoDB and, if the user provides a project version tag, also cross-check remote versions against local versions. This would ensure that local scripts only insert if versions agree. We have filed a more detailed issue for this proposal.
  2. Indeed we realize that it is important for projects to ensure the availability of code revision history--for some, it might even be mandatory. However, because we do not have control over external software that generates data, we cannot generally enforce the presence of external revisions. As we have already stated above, mandatory user (by database access) and last_modified information are stored on default. If one truly wants to ensure the presence of code revision, one can make the tag column mandatory or even add more detailed mandatory columns. To make life easy, we have added a pre_save method to the Base class which users can utilize for default population. Also, we have added further statements in the doc pages.

Can it guarantee integrity in the presence of buggy code, which might for example attempt to delete everything?

In general, it depends on the database access--the more control you give the user, the more "creative" side effects might occur. For example, we use PostgreSQL for LatteDB which allows specifying different access levels for different users (since we are talking about collaborative approaches, I believe it is fair to assume more sophisticated databases than SQLite). If a user can only insert but not delete: yes, it is possible to prevent buggy "delete all" accidents. Once a user has admin access, basically everything can happen. We generally recommend creating backups of the database (for which many tools exist) to prevent significant data loss through any means. We will add a detailed statement in the documentation regarding access levels and data loss.

It is also possible to implement checks and tests on the Python side by overloading the check_consistency method (as we document). However, if data is produced incorrectly in a way that is not being picked up by the tests, it will be written. The only protection here is looking at user/date and other columns (see revision comment above) to clean up faulty pushes later on.

We see that the word "guarantee" is too strong for a general scenario and changed it to "promote" in the paper.

Also note that LatteDB is the only live example in the paper, and it doesn't let me see any data (I get a login form), though it lets me open a submission form. This doesn't seem in line with "supports [...] open-data oriented projects", although I'm sure you might have your reasons for this configuration.

LatteDB currently contains data for upcoming publications, and in accordance with the Department of Energy policies, the data will be made public after the paper is accepted. This ensures that we are able to produce papers from data generated by computing grants given to our collaboration before the data is released to the public. However, because of how the data and analysis are centralized, this framework also makes releasing data and analysis results a straightforward task (changing a switch). This is a big step forward towards transparency and open science compared to how other Lattice QCD collaborations are releasing (if at all), the data and analysis that is presented in publications.

The word "Supports" does not mean enforces---we want to give users the option to decide how data is handled. We have added a screenshot of an example for the login-required table views to this response (dynamically filterable from a web browser). If it helps with the publication process we can create a guest account for viewing available data.

table-example

As a side note: Technically the example project and unit tests are also live examples (of no other use then demonstrating and testing features though).