Sasheel - Data source discrepancies

Hannah-Davies commented 3 years ago

Just wanted to check the process of updating the data sources on the Phenotype Library with datasets registered on the Gateway. I note the following discrepancies, but I am unable to help fix:

Missing Gateway Dataset URLs: • ID: 3 - Civil Registration - Deaths • ID: 8 - CPRD Aurum (Same as ID: 6) • ID: 17 - SMR01 • ID: 16 - HES APC • ID: 19 - Office for National Statistics - Death Registration Data? • ID: 7 - CPRD GOLD Uncertain Datasets • ID: 27 - Clinical Practice Research Datalink - There is no CPRD dataset. This should be either CPRD GOLD or AURUM • ID: 19 - Primary Care - Suggest to remove • ID: 24 - QResearch? Which dataset? • ID: 25 & ID: 14 - THIN are the same data source

Hannah-Davies commented 3 years ago

@shahzadmumtaz22 is there any update on this please ?

shahzadmumtaz22 commented 3 years ago

Hi Hannah-Davies,

I have replied vie email.

For UoM all fixes are done and github repository is updated.
For CALIBER, I have made a PR for Spiros to review and merge (though Github is not passing some tests).
As mentioned, some needs to be fixed in the API import script .

Hannah-Davies commented 3 years ago

@ieuans are you able to look at the other issues? Email text (as unable to attach email itself): Please find below my feedback/investigation results (marked as red text) for each one of the data source related issues highlighted by Susheel. Mainly there were three different kinds of issues:

While importing, the script has missed the URL and id of some of the data sources though they are mentioned in the data_source.yml and their associated phenotype files have right names as well, and the link was also working. @Ieuan Scanlon, would you please investigate your import script to fix this issue.
In some cases, in the markdown definitions of the CALIBER phenotypes, the name of the data source appeared and the same was not available in the data_sources.yml file and our importing script considered it as a separate data source (as per my understanding, @Ieuan Scanlon correct me if it is not this way). I think the quickest fix is to change the source by putting the correct names of the data sources in the markdown phenotype definition files (in line with data_sources.yml). I can make changes to the CALIBER Github resources and will make pull request for Spiros to review/merge.
In the third scenario, there were cases where across two different GitHub resources (CALIBER github resource managed by Spiros and UoM github resource managed by me) have different spellings/capitalizations of data sources. This resulted our import script to consider them separate data sources. For the UoM issues, there were only very few cases, and they are fixed now, and GitHub resource is updated. In case if we are planning to make another import before the launch, this should have automatically be fixed in that case.
Susheel raised issues and their investigation results (marked as red): Missing Gateway Dataset URLs: • ID: 3 - Civil Registration - Deaths (For this one we have URL available in the GitHub repository of CALIBER (maintained by Spiros), we might have missed it in the import process) . A screen shot of the GitHub repository of a file "data_sources.yml" is given below.
• ID: 8 - CPRD Aurum (Same as ID: 6) (Both version (ID: 6 and ID: 8) are the same datasets. It seems to me that the problem is caused because some phenotype definition have capital word AURUM (no link given in the data source file) whereas others have Aurum (linkable to a data source given in data_source.yml). I can fix that at the source because there are only 4 phenotype definitions with AURUM as capital in the CALIBER GitHub repository.
• ID: 17 - SMR01 The GitHub repository have id and URL for this data source as well in the source (https://github.com/spiros/hdr-caliber-phenotype-library/blob/master/_data/data_sources.yml). There is only one phenotype definition against this and that is using the correct data source and don't know why it's not showing id and URL in the data source. @Ieuan Scanlon, Will you please see what's going wrong when importing this from the GitHub resource.

• ID: 16 - HES APC It seems that problem is in the source markdown file where data source name is not correctly mentioned (and it looks the data source name appearing is coming from that wrong name (which is not linkable to data sources) and resulting an additional data source). A quick fix for this seems to me is that I can make a change in the GitHub source and will make a PR for Spiros to review/merge in the GitHub repository. It is important to mention that data_sources.yml file has this dataset including id and URL. • ID: 19 - Office for National Statistics - Death Registration Data? There is only one associated phenotype to this, and its hyperlink is working fine. The data_sources.yml file has id and URL. Something would have gone wrong to the URL and id while importing this.
• ID: 7 - CPRD GOLD There are four associated phenotypes to it, and this can be fixed at the source GitHub repository by giving the data source to "Clinical Practice Research Datalink GOLD" instead of "Primary care (Clinical Practice Research Datalink GOLD)" Uncertain Datasets • ID: 27 - Clinical Practice Research Datalink - There is no CPRD dataset. This should be either CPRD GOLD or AURUM. This is fixed in the source as GOLD was missing and the import script has created another data source. • ID: 19 - Primary Care - Suggest to remove I think the id of this one should be 20 not 19. In the source it is written as primary not clear should we consider it as GOLD or AURUM. There is only one phenotype under this data source and for that there is no associated publication to this. It's better to get some input on this from Spiros. • ID: 24 - QResearch? Which dataset? For the UoM phenotypes, there were some phenotypes associated with this primary care dataset. I can't find this dataset in the healthdategateway. The link to this dataset is external (i.e. https://www.qresearch.org/)

• ID: 25 & ID: 14 - THIN are the same data source . They both are the same dataset and I think they appear differently because both GitHub respositories (github repository of CALIBER and GitHub repository of UoM) have slight variation in their names: One with abbreviation mentioned and one without mentioning abbreviation at the end. I have amended UoM source, and this should not appear in the next import if we have a plan to do it before the release. I was not able to find this dataset in the healthdatagateway and the link to this is external (i.e. https://www.the-health-improvement-network.com/)

I hope this will help. In case if you have any further query, please let me know.

Hannah-Davies commented 3 years ago

@shahzadmumtaz22 we still still think there is an issue with the below: Cardiovascular code list - 2ndary data under primary Pneumonia - references snomed instead of UK biobank Ethnic status - table still incorrect for ethnicity coding

SwanseaUniversityMedical / concept-library

Sasheel - Data source discrepancies #440