bridgedb / datasources

Repository with the BridgeDb data source.
Creative Commons Zero v1.0 Universal
4 stars 8 forks source link

Add data integrity tests for MIRIAM URIs/URNs #23

Closed cthoyt closed 2 years ago

cthoyt commented 2 years ago

I found that there were several URNs pointing to prefixes in MIRIAM that do not exist. This pull request adds unit tests that check that all entries in the uri column starting with urn:miriam: are indeed valid MIRIAM prefixes.

Entries for the following were updated to use the correct MIRIAM prefixes:

Entries for the following are not available in MIRIAM and were removed:

For SPRINT and Unipathway, there are alternative prefixes from the Bioregistry (https://bioregistry.io/sprint and https://bioregistry.io/upa). For BIND and IPI, we can add prefixes to the Bioregistry and mark them as "deprecated", if that helps.

egonw commented 2 years ago

ping @Finterly, you last year some issues with upper/lower case things in the datasources.csv, so please take notice of the changes in this patch. no action is needed otherwise whatsoever.

cthoyt commented 2 years ago

I'm not quite sure I understood your feedback @egonw, are there any changes needed for this PR? Or do you have to first consider the downstream impact of standardizing terminology here?

DeniseSl22 commented 2 years ago

@egonw : any time to check this PR from @cthoyt ?

egonw commented 2 years ago

cc @Finterly

egonw commented 2 years ago

@cthoyt:

E   ModuleNotFoundError: No module named 'bs4'
cthoyt commented 2 years ago

getting there!

egonw commented 2 years ago

ping @cthoyt, there was overlap with the other PR. I pulled in one patch from this PR.

Can you have a look? And salvage other bits? For the changes in the datasources.tsv, please put this in an isolated PR because I want @Finterly to have a look at that, because of PathVisio depending on that info.

cthoyt commented 2 years ago

@egonw not sure if makes sense to put the PR for tests that are going to fail - the idea would be that once you implement some tests, you update the data to follow that standard and never let the tests fail again. But if that's what you want, I could separate these.

in the mean time I resolved the diff

egonw commented 2 years ago

@Finterly, @mkutmon, please take note of the changes in upper/lower casing in the datasources.tsv. Fin, if I remember correctly, there was something about upper/lower case last year, but I cannot quite remember what it was. Are these changes going to cause any downstream issues?

cthoyt commented 2 years ago

@Finterly @mkutmon feedback would be appreciated

egonw commented 2 years ago

Okay, there are merge conflicts, so will have to do this one manually.

cthoyt commented 2 years ago

@egonw not sure what the issue is, in the most recent commit on this branch 289e54b I addressed all conflicts with the main branch

egonw commented 2 years ago

This branch cannot be rebased due to conflicts

It says "This branch cannot be rebased due to conflicts" ... i will explore, but after the meeting I am in.

cthoyt commented 2 years ago

Are you looking at the github interface for merging or trying to do this locally?

egonw commented 2 years ago

github interface

cthoyt commented 2 years ago

then I'd suggest doing "Squash and merge" which would turn all commits into this branch into a single one as it's merged. There isn't really any valuable information in the small commits made during this PR