OxfordDemSci / ICS_Analysis

Mixed methods approach and interactive dashboard to analyse research impact through Impact Case Studies submitted to the UK's Research Excellence Framework (REF) 2021.
https://shape-impact.co.uk
GNU General Public License v3.0
5 stars 0 forks source link

Adding dataset generation pipeline to repository #38

Closed MarkDVerhagen closed 9 months ago

MarkDVerhagen commented 9 months ago

Including dataset collection scripts to the current repository s.t. everyone can exactly reproduce results with minimal data sharing among collaborators.

@bz-dev Can you provide me with the excel file linked to in ./src/analysis/topic_modelling/bert.py (sys.argv[1]). @bz-dev Can you confirm that sys.argv[1] refers to the data (actual .xlsx file) and sys.argv[2] refers to the directory containing folders models, output, figures? Can you provide me with the two sys.argv you provide the script?

MarkDVerhagen commented 9 months ago

@crahal Can you confirm that the Dimensions data is the only dataset apart from that in ./data/manual that would require sharing among collaborates (since not everyone is able to run the Google BigQuery without an account etc.)?

MarkDVerhagen commented 9 months ago

@bz-dev If possible, could you also supply me with the raw excel file you would reference in the bert.py script?

bz-dev commented 9 months ago

@MarkDVerhagen The excel it processes is the output of clean_ics_level from src/data_wrangling_/1_clean_data/11_ref.py, which now should have already been merged into src/generate_dataset/make_enhanced_data.py.

And yes, sys.argv[1] refers to the data (actual .xlsx file) and sys.argv[2] refers to the directory containing folders models, output, figures.