ebmdatalab / global-trial-landscape

Other
0 stars 0 forks source link

Level of granularity of ctgov sponsor data #17

Open ccunningham101 opened 11 months ago

ccunningham101 commented 11 months ago

Do we actually want to use ROR for ctgov sponsor names? Around 5% of sponsors get mapped to another sponsor name

Most generally make sense:
array(['Columbia University', 'Teachers College, Columbia University']
array(['Second Affiliated Hospital, School of Medicine, Zhejiang University',
'Zhejiang University'], dtype=object)
array(['University of Michigan Rogel Cancer Center',
'University of Michigan'], dtype=object)
array(['National Taiwan University Hospital',
'National Taiwan University Hospital Hsin-Chu Branch'],
dtype=object)

Some we might deem necessary: array(['University of Alexandria', 'Alexandria University'], dtype=object)

But maybe there was a reason to keep them separate? And we should trust what separate accounts have been made on ctgov array(['NYU Langone Health', 'New York University',
'NYU College of Dentistry'], dtype=object)
array(['University of North Carolina, Chapel Hill',
'UNC Lineberger Comprehensive Cancer Center'], dtype=object)
array(['Wake Forest University', 'Wake Forest University Health Sciences'], dtype=object)
array(['Mansoura University', 'Mansoura University Children Hospital']

And some are incorrect (maybe because one or more of the sites does not exist in ROR) array(['Mayo Clinic', 'Malo Clinic'], dtype=object)

In the absence of other information city/country information, it will be hard to check the ROR match

ccunningham101 commented 11 months ago
  1. Decision to go with the level of data that ROR has access to i.e. if it combines NYU Langone Health and New York University, but does not combine Oxford University Hospitals NHS Trust and Oxford University that is okay
  2. We can scan the data manually for incorrect mappings i.e. Mayo Clinic/Malo Clinic
  3. We could maybe steal site data to get city/country to add to sponsor