Open cbizon opened 2 years ago
https://github.com/mapping-commons/sssom
We need to figure out a few things if we go to sssom:
@matentzn tells me that these problems can be handled with off the shelf sssom
{"type": "biolink:Disease", "ic": "100", "identifiers": [{"i": "MONDO:0018670", "l": "symptomatic form of fragile X syndrome in female carrier"}, {"i": "ORPHANET:449291", "l": "Symptomatic form of fragile X syndrome in female carrier"}, {"i": "UMLS:CN237736"}]}
Assuming MONDO:0018670 is the clique leader (sssom 0.9.0, not sssom 1.0), a sssom file would look something like this:
subject_id | subject_label | subject_category | predicate_id | object_id | object_label | object_category | match_type | other |
---|---|---|---|---|---|---|---|---|
MONDO:0018670 | symptomatic form of fragile X syndrome in female carrier | biolink:Disease | skos:exactMatch | ORPHANET:449291 | Symptomatic form of fragile X syndrome in female carrier | biolink:Disease | HumanCurated | { subject_information_content: 100 } |
MONDO:0018670 | symptomatic form of fragile X syndrome in female carrier | biolink:Disease | skos:exactMatch | UMLS:CN237736 | biolink:Disease | HumanCurated | { subject_information_content: 100 } |
There are some features for natively supporting semantic similarity measures, see https://mapping-commons.github.io/sssom/Mapping/, but I don't think subject_information_content
would qualify to that.
Thanks! Is it required to repeat the subject_labels or categories etc when they are repeated?
If we are using the ordering of the rows as information, are we abusing the format?
I would keep the information redundant with the labels, but nothing in sssom requires you to. I like that in general so that I can more easily combine different mappings sets, merge them etc.
I think expecting the row order to mean something is not very reliable.
If you wanted to be 100% reliable you could of course export all cliques as separate sssom files. This is what I think Chris does. But it would result in 5000 files. It's an interesting use case. Maybe if you could create an identifier for each clique, you could put it into the "other" column. Sorry maybe sssom is not ideal here, but we could consider extensions to the format to cover this use case (named groups for mappings).
I suppose we could put a clique id of some sort in the other
column. And perhaps an index to define the order if we don't want to rely on the row order...
The goal here is to have a format for storage not for sending back to clients?
In that case, is the ordering a property of the mappings themselves, or a function that NodeNormalizer applies after the fact (ie a priority list of prefixes from biolink)? If it's a property of the mappings themselves maybe there is a more direct way to express this?
Same with IC value?
I've written a program to convert some of the files in the Babel compendia into SSSOM so we can see look at them in my Dropbox. These files appear to pass validation on sssom-py apart from missing CURIE maps. If everybody's happy with these files, I can run my program on all the Babel compendia (which will probably take 0.5-1 days to run).
Some thoughts and questions:
${compendium_filename}#${line_number_starting_from_zero}
.match_type
because the master branch of sssom-py
requires that, but once that's updated to the latest SSSOM version, I'll change that to a mapping_justification
of semapv:MappingChaining
("A matching process based on the traversing of multiple mappings.") since I think that best captures how Babel is built.mapping_set_id
, mapping_set_description
, mapping_set_version
; see foodie-inc-2022-05-01.sssom.tsv
as an example, but I can add those easily if needed.
- I don't think we have a definitive list of CURIE prefixes anywhere in Babel (the closest thing I could find was the prefixes file at https://github.com/TranslatorSRI/Babel/blob/master/src/prefixes.py). To generate CURIE maps for each of the compendia file, I'm planning to (1) hard-code the list of prefixes that are currently used in those files, and (2) make the code generate a warning if it sees a CURIE that doesn't use one of the predefined prefixes.
There's a check in babel against the biolink prefixes for each type. So it will potentially write out anything in the biolink yaml for each type, and should not write out anything that isn't in that prefix list.
Just as we want to have Babel write SSSOM, NN will need to read it.