elixir-luxembourg / daisy

Data Information System (DAISY) is a data bookkeeping application designed to help Biomedical Research institutions with their GDPR compliance.
GNU Affero General Public License v3.0
13 stars 10 forks source link

Add reporting feature to DAISY #99

Open pinarpink opened 5 years ago

pinarpink commented 5 years ago

Is your feature request related to a problem? Please describe.

Users of DAISY, data stewards, DPOs need to be able to create reports. E.g.
(1) The set of data that do not have a storage end date set and the storage criteria field is also empty. (2) All data that have been transferred to Partner A, or partners out of the EU

Describe the solution requested

We need a reporting solution, maybe initially suited to power users, who would build the query themselves .

Describe possible alternatives

Pick from below based on prior experience https://djangopackages.org/grids/g/reporting/

Other info context

jLebioda commented 4 years ago

After reviewing available options from https://djangopackages.org/grids/g/reporting/, I would say two options are relatively good (still mantained, claiming to be production ready):

  1. Use Django SQL Explorer (https://github.com/groveco/django-sql-explorer) a. Add it to the requirements in setup.py, and settings.py b. Restrict access, so only superuser could access the SQL panel (it'd give very powerful access to data discovery) c. Add another user to the postgres database, it must not have write access d. Construct some example reporting queries and save them for later reuse

  2. Use Django Excel Response (https://github.com/tarkatronic/django-excel-response) a. Add it to the requirements in setup.py, and settings.py b. Create some example reporting views, that would return Django's ORM results (like: return Project.objects.filter(something)) c. Wrap the views in ExcelResponse d. Add a simple page available to superusers, that would offer downloading excel/csv reports.

  3. Similar to option no. 2 - generate PDF reports ourselves

I could prepare PoC of either option (or all of them); what do you think?

pinarpink commented 4 years ago

Hi @jLebioda I prefer to go with option 1. We already have a reporter user (read-only) on the DB.

@neoflex @vildead what are your thoughts?

vildead commented 4 years ago

The first option looks really cool. What is the expected administration/coding overhead? The second option seems to be more light.

jLebioda commented 4 years ago

I think both options should be pretty straightforward and light; I'll prepare a PoC then

vildead commented 4 years ago

Here are some initial SQL commands: Projects:

WITH
    /* get projects */
    projects
    AS (SELECT id project_id,
              acronym,
              has_cner,
              cner_notes,
              has_erp,
           erp_notes,
              umbrella_project_id
       FROM   PUBLIC.core_project),
    /*get all funding sources as one value per project*/
    funding_sources
    AS (SELECT project_id,
               String_agg(NAME, '; ') funding
        FROM   PUBLIC.core_fundingsource a
               RIGHT OUTER JOIN PUBLIC.core_project_funding_sources b
                             ON b.fundingsource_id = a.id
        GROUP  BY project_id),
    /*compute number of datasets for each project*/
    n_datasets
    AS (SELECT project_id,
               count(*) number_of_datasets
        FROM   core_dataset
        WHERE  project_id IS NOT NULL
        GROUP  BY project_id),
    umbrella_projects
    as (SELECT id project_id,
               acronym umbrella_project_acronym
        FROM   PUBLIC.core_project),
   /*documents */
    ethics_approvals
    AS (SELECT DISTINCT
              object_id project_id,
              't' has_ethics_approval_document 
       FROM   core_document cd 
       INNER JOIN django_content_type ct 
              ON  cd.content_type_id = ct.id
       WHERE  domain_type = 'ethics_approval' AND model = 'project'),
    /* local custodians*/
     custodians
     AS (
       SELECT 
              project_id, 
              string_agg(username, '; ') local_custodians
       FROM 
              core_project_local_custodians pc 
       INNER JOIN core_user u 
              ON pc.user_id = u.id
       GROUP BY project_id
)

SELECT a.*,
      COALESCE(ea.has_ethics_approval_document, 'f') has_ethics_approval_document,
      cu.local_custodians,
      up.umbrella_project_acronym,
      c.funding,
      COALESCE(d.number_of_datasets, 0) number_of_datasets
FROM   projects a
      LEFT JOIN n_datasets d
             ON a.project_id = d.project_id
      LEFT JOIN funding_sources c
             ON a.project_id = c.project_id
      LEFT JOIN umbrella_projects up
             ON a.umbrella_project_id = up.project_id
      LEFT JOIN ethics_approvals ea
             ON a.project_id = ea.project_id
      LEFT JOIN custodians cu 
             ON a.project_id = cu.project_id

Datasets

WITH datasets
     AS (SELECT a.id AS dataset_id,
                title,
                project_id,
                b.name sensitivity
         FROM   core_dataset a
                LEFT JOIN core_sensitivityclass b
                       ON b.id = a.sensitivity_id),
     /* number of data declarations */
     n_declarations
     AS (SELECT dataset_id,
                Count(*) number_of_datadeclarations
         FROM   core_datadeclaration
         GROUP  BY dataset_id),
     /* number of shares*/
     n_shares
     AS (SELECT dataset_id,
                Count(*) number_of_shares
         FROM   core_share
         GROUP  BY dataset_id),
     /* number of storages */
     n_locations
     AS (SELECT dataset_id,
                Count(*) number_of_locations
         FROM   core_datalocation
         GROUP  BY dataset_id)
SELECT a.title,
       a.sensitivity,
       COALESCE(b.number_of_datadeclarations, 0) number_of_datadeclarations,
       COALESCE(c.number_of_shares, 0)           number_of_shares,
       COALESCE(d.number_of_locations, 0)           number_of_locations,
       a.dataset_id,
       a.project_id
FROM   datasets a
       LEFT JOIN n_declarations b
              ON a.dataset_id = b.dataset_id
       LEFT JOIN n_shares c
              ON a.dataset_id = c.dataset_id
       LEFT JOIN n_locations d
              ON a.dataset_id = d.dataset_id  

Data declarations

/*data declarations*/
WITH declarations 
    AS (SELECT 
               dd.title,
               da.title as dataset_title,
               dd.comments,
               dd.end_of_storage_duration,
               dd.storage_duration_criteria,
               dd.has_special_subjects,
               dd.id as datadeclaration_id,
               dd.dataset_id,
               dd.contract_id,
               dd.partner_id
        FROM   core_datadeclaration dd
        INNER JOIN core_dataset da ON dd.dataset_id = da.id
      ),
      /*number of locations*/
      n_locations 
    AS (SELECT datadeclaration_id,
               COUNT(datalocation_id) number_of_locations
          FROM core_datalocation_data_declarations
          GROUP BY datadeclaration_id)
SELECT a.*,
       COALESCE(b.number_of_locations,0) number_of_locations
FROM   declarations a LEFT JOIN n_locations b USING (datadeclaration_id)
jLebioda commented 4 years ago

How two first queries from the snippet above look like in the SQL explorer:

Screenshot 2020-04-17 at 13 25 56 Screenshot 2020-04-17 at 13 26 26

Django-excel-response works flawlessly, but choosing the data for export must be done in Python - e.g. to report the projects without ethics approval:

def projects_without_ethics_approval(request):
    objs = Project.objects.filter(has_erp=False)
    return ExcelResponse(objs)

for exporting the data that are more complex than objects from a model, custom list/dict can be created

jLebioda commented 4 years ago

exceldata.xlsx

^this comes from dummy data

jLebioda commented 4 years ago

Partially addressed in #163

Here I will only add django-sql-explorer

vildead commented 4 years ago

Dear @jLebioda , is the Django SQL Explorer feature ready? The exports looks great!

We might create a patch/minor to today's release :)

jLebioda commented 4 years ago

Dear @vildead the preview of the feature is available on a separate branch https://github.com/elixir-luxembourg/daisy/tree/99-add-reporting-sql-exporer It was done some time ago, thus it might need a rebase

pinarpink commented 3 years ago

@jLebioda this is currently only available in debug mode. Can we discuss in the next DAISY regular meeting whether we can move this to production, then we will develop a few reports.

jLebioda commented 3 years ago

@jLebioda this is currently only available in debug mode. Can we discuss in the next DAISY regular meeting whether we can move this to production, then we will develop a few reports.

@pinarpink Yes, naturally!