OxfordDemSci / ICS_Analysis

Mixed methods approach and interactive dashboard to analyse research impact through Impact Case Studies submitted to the UK's Research Excellence Framework (REF) 2021.
https://shape-impact.co.uk
GNU General Public License v3.0
5 stars 0 forks source link

dashboard data pipeline #54

Closed doug-leasure closed 8 months ago

doug-leasure commented 8 months ago

The new data and repo reorganisation creates a few issues for the dashboard database pipeline defined in reformat_csvs_for_db.py. I added our new source data with commits 507983f12a67de3a14b8b2110f4e8958c21187ef and 62e93a9775363c721ad8ebdd49ee986fce55adcd.

I will create a checklist here, and we can follow-up with more detailed discussions of each point as needed.

GISRedeDev commented 8 months ago

@doug-leasure Just to confirm that in the above nn2_threshold0.01_reduced.xlsx that you have instructed to delete, we were using nn3_threshold0.01_reduced.xlsx. Does this make any difference?

doug-leasure commented 8 months ago

@GISRedeDev I think these can probably all be deleted because we no longer want to use the model-based topic probablities now that we have manually reassigned a subset of the ICS into new topics.

The topic_weights table in the database can now simply use 1 or 0 so that each ICS is assigned to only one topic with a probability of 1 (i.e. see the next item in the checklist).

GISRedeDev commented 8 months ago

@doug-leasure I think I've fixed the script to handle these changes. Just a couple of things:

I have taken each ICS row, and made a row with that ID for each row from the topics.csv. Regardless of the BERT_prob or max_prob in the ICS row, I have given the row in the weights table a probability of 1 where ics.topic_id matches the topics.topic_id, and 0 for the rest of the rows for that ics_id. I saw that thtere were many that were NaN in the ics table – these are ignored. Also, there are ics rows in which the BERT_prob and max_prob are much lower than 1 (see image) – these have been assigned a probability of 1 in the weights table (I hope this makes sense!).

image

For the narrative, I decided to hardcode the html into the database. This is when I noticed there are no keywords in the topics.csv. Should there be?

I've also made a table for the UK region lookup. I've not made the database yet or added any functionality to the API yet, but will let you know once I've made some progress.

doug-leasure commented 8 months ago

@GISRedeDev , thanks! I think this is all good.

  1. Hard coding the narrative html into the database as before is okay.
  2. Let's drop the keywords from the dashboard.
  3. You've done the right thing by assigning probability = 1 where ics.topic_id == topics.topic_id and probability = 0 otherwise.
  4. No problem that there are some rows where BERT_prob and max_prob << 1. These should still get probability = 1 for the topic_id they are assigned to.

Good work and thanks for persevering through those changes. I think we're on track.

I will close this issue. Let's open new issues for other tasks where needed.