OxfordDemSci / ICS_Analysis

Mixed methods approach and interactive dashboard to analyse research impact through Impact Case Studies submitted to the UK's Research Excellence Framework (REF) 2021.
https://shape-impact.co.uk
GNU General Public License v3.0
5 stars 0 forks source link

Missing institution codes present in results data but not in ICS data #1

Closed MarkDVerhagen closed 9 months ago

MarkDVerhagen commented 1 year ago

Some institution in the results .xlsx are missing in the ics .xlsx

raw_ics = pd.read_excel(os.path.join(raw_path,
                                         'raw_ics_data.xlsx'))
raw_results = pd.read_excel(os.path.join(raw_path,
                                             'raw_results_data.xlsx'),
                                skiprows=6)
raw_results = raw_results[raw_results['Institution code (UKPRN)'] != ' ']
raw_ics = raw_ics.copy()[raw_ics['Institution UKPRN code'] != ' ']

results_ins_ids = [int(i) for i in raw_results['Institution code (UKPRN)'].unique()] # UKPRN in results
ics_ins_ids = [int(i) for i in raw_ics['Institution UKPRN code'].unique()]  # UKPRN in ics

np.mean([i in results_ins_ids for i in ics_ins_ids]) ## 100%

np.mean([i in ics_ins_ids for i in results_ins_ids]) ## 98.7%

[i for i in results_ins_ids if i not in ics_ins_ids] ## [10009315, 10005700]
doug-leasure commented 9 months ago

@MarkDVerhagen , has this issue been resolved?

MarkDVerhagen commented 9 months ago

Yes, below passes

import pandas as pd
import os

enhanced_ref_data = pd.read_csv(
    os.path.join('data', 'final', 'enhanced_ref_data.csv')
)

assert enhanced_ref_data['inst_id'].isna().sum() == 0