OxfordDemSci / ICS_Analysis

Mixed methods approach and interactive dashboard to analyse research impact through Impact Case Studies submitted to the UK's Research Excellence Framework (REF) 2021.
https://shape-impact.co.uk
GNU General Public License v3.0
5 stars 0 forks source link

Create database #17

Closed doug-leasure closed 1 year ago

doug-leasure commented 1 year ago

We need to create the dockerised PostgreSQL database that will drive the dashboard. The code and data for it will go into ./src/dashboard/sql/. A first draft of the schema is attached here: ICS_db_schema.pdf.

doug-leasure commented 1 year ago

The source data needed to support the db schema above are found primarily in two places in the repo.

ics table The "ics" table will be taken from ./data/enriched/enriched_ref_ics_data.csv. Note: I only included columns in the schema that I thought were strictly required for the dashboard graphs, but we will likely end up includig all of the colums from the enriched data in the database so that dashboard users can browse the full data. The enriched data can be generated by running the following scripts sequentially:

  1. ./src/data_wrangling/0_get_data/01_ref.py
  2. ./src/data_wrangling/1_clean_data/11_ref.py
  3. ./src/data_wrangling/2_enrich_data/23_ics.py

topic_weights The "topics_weights" table will be taken from ./data/topic_outputs/production_model/ics_data_modelling_top_5_full_text.csv which is saved directly into the repo. We will be getting a long format version of this soon from @lindali97 with the probability of each ICS belonging to EVERY topic rather than just the top 5 topics in wide format.

topics table The "topics" table will contain a single row for every topic. We will manually assign names based on the results in ./data/topic_modelling/production_model/BERT_keywords_full_text.csv. The "groups" will be assigned semi-manually by @crahal based on the results from ./data/topic_modelling/production_model/*_bert_model_hierarchy.csv.

GISRedeDev commented 1 year ago

Just leaving this here as things that still need to be done/confirmed

GISRedeDev commented 1 year ago

@doug-leasure Apologies for the delay in getting this db ready. I have just pushed the database and a basic api to 17-create-database. I had a bit of trouble transferring the db from my local env to the docker-compose, so for now I have hard-coded the passwords for the DB into the code. I will tidy this up soon. I just wanted to get it to you to have a look/play.

As usual, this can be spun up using docker-compose up --build in the ./src/dashboard directory. The API should be served to localhost:8000. I have used open-api specification, so the endpoints are defined in app/api-config.yaml (let me know if this doesn't make sense). The docs for the api, as well as a testing area for the endpoints will be served to localhost:8000/api/ui where you can test out the endpoints. You can also call them directly at (for e.g.) localhost:8000/api/init. I have used random words for the topic names, descriptions etc until we get the final data. Please take a look at the queries used and data returned to check if these are the algorithms you were intending to use. For the minute there are only 2 endpoints bit we will build more functionality soon.

Let me know if anything needs explaining.

doug-leasure commented 1 year ago

Closing this as resolved by PR #18, noting that @GISRedeDev raised some issues as tick boxes above that need to be raised as separate issues if they have not already been resolved.