TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
9 stars 6 forks source link

Read SSSOM #111

Open cbizon opened 2 years ago

cbizon commented 2 years ago

Just as we want to have Babel write SSSOM, NN will need to read it.

cbizon commented 2 years ago

https://github.com/mapping-commons/sssom

We need to figure out a few things if we go to sssom:

  1. how to preserve the ordering of the identifiers list
  2. how to include type and information content
cbizon commented 2 years ago

@matentzn tells me that these problems can be handled with off the shelf sssom

matentzn commented 2 years ago
{"type": "biolink:Disease", "ic": "100", "identifiers": [{"i": "MONDO:0018670", "l": "symptomatic form of fragile X syndrome in female carrier"}, {"i": "ORPHANET:449291", "l": "Symptomatic form of fragile X syndrome in female carrier"}, {"i": "UMLS:CN237736"}]}

Assuming MONDO:0018670 is the clique leader (sssom 0.9.0, not sssom 1.0), a sssom file would look something like this:

subject_id subject_label subject_category predicate_id object_id object_label object_category match_type other
MONDO:0018670 symptomatic form of fragile X syndrome in female carrier biolink:Disease skos:exactMatch ORPHANET:449291 Symptomatic form of fragile X syndrome in female carrier biolink:Disease HumanCurated { subject_information_content: 100 }
MONDO:0018670 symptomatic form of fragile X syndrome in female carrier biolink:Disease skos:exactMatch UMLS:CN237736 biolink:Disease HumanCurated { subject_information_content: 100 }

There are some features for natively supporting semantic similarity measures, see https://mapping-commons.github.io/sssom/Mapping/, but I don't think subject_information_content would qualify to that.

cbizon commented 2 years ago

Thanks! Is it required to repeat the subject_labels or categories etc when they are repeated?

If we are using the ordering of the rows as information, are we abusing the format?

matentzn commented 2 years ago

I would keep the information redundant with the labels, but nothing in sssom requires you to. I like that in general so that I can more easily combine different mappings sets, merge them etc.

I think expecting the row order to mean something is not very reliable.

If you wanted to be 100% reliable you could of course export all cliques as separate sssom files. This is what I think Chris does. But it would result in 5000 files. It's an interesting use case. Maybe if you could create an identifier for each clique, you could put it into the "other" column. Sorry maybe sssom is not ideal here, but we could consider extensions to the format to cover this use case (named groups for mappings).

cbizon commented 2 years ago

I suppose we could put a clique id of some sort in the other column. And perhaps an index to define the order if we don't want to rely on the row order...

cmungall commented 2 years ago

The goal here is to have a format for storage not for sending back to clients?

In that case, is the ordering a property of the mappings themselves, or a function that NodeNormalizer applies after the fact (ie a priority list of prefixes from biolink)? If it's a property of the mappings themselves maybe there is a more direct way to express this?

Same with IC value?

gaurav commented 2 years ago

I've written a program to convert some of the files in the Babel compendia into SSSOM so we can see look at them in my Dropbox. These files appear to pass validation on sssom-py apart from missing CURIE maps. If everybody's happy with these files, I can run my program on all the Babel compendia (which will probably take 0.5-1 days to run).

Some thoughts and questions:

  1. I don't think we have a definitive list of CURIE prefixes anywhere in Babel (the closest thing I could find was the prefixes file at https://github.com/TranslatorSRI/Babel/blob/master/src/prefixes.py). To generate CURIE maps for each of the compendia file, I'm planning to (1) hard-code the list of prefixes that are currently used in those files, and (2) make the code generate a warning if it sees a CURIE that doesn't use one of the predefined prefixes.
  2. We have a large number of cliques that only consist of a single individual (e.g. CHEBI:61535 "poly(1,4-phenylene oxide) polymer" from ChemicalMixture.sssom.tsv), which we would still like to load into NodeNorm so that it can be returned as the preferred identifier. I'm currently modeling these by saying this identifier is an exactMatch to itself. Is there a more elegant way of modeling this?
  3. I'm not sure if we need a separate clique ID -- wouldn't the clique leader's ID be unique within a particular compendia file? In this run, I made up a clique ID in the format ${compendium_filename}#${line_number_starting_from_zero}.
  4. Is there any benefit to putting the synonym information into the SSSOM files as well? I don't think so, and only used the information from the compendium files for these files.
  5. I used match_type because the master branch of sssom-py requires that, but once that's updated to the latest SSSOM version, I'll change that to a mapping_justification of semapv:MappingChaining ("A matching process based on the traversing of multiple mappings.") since I think that best captures how Babel is built.
  6. I didn't fill in any of the optional metadata fields (e.g. mapping_set_id, mapping_set_description, mapping_set_version; see foodie-inc-2022-05-01.sssom.tsv as an example, but I can add those easily if needed.
cbizon commented 2 years ago
  1. I don't think we have a definitive list of CURIE prefixes anywhere in Babel (the closest thing I could find was the prefixes file at https://github.com/TranslatorSRI/Babel/blob/master/src/prefixes.py). To generate CURIE maps for each of the compendia file, I'm planning to (1) hard-code the list of prefixes that are currently used in those files, and (2) make the code generate a warning if it sees a CURIE that doesn't use one of the predefined prefixes.

There's a check in babel against the biolink prefixes for each type. So it will potentially write out anything in the biolink yaml for each type, and should not write out anything that isn't in that prefix list.