Retired CUIs in `semmedVER43_2022_R_PREDICATION.csv`

erikyao commented 2 years ago

File semmedVER43_2022_R_PREDICATION.csv contains 117,589,597 rows. After removing rows with SUBJECT_NOVELTY == 0 or OBJECT_NOVELTY == 0, 81,282,024 rows remained. Among those rows, there are 303,080 unique subject CUIs, and 262,268 unique object CUIs (piped CUIs decomposed and counted).

Following MRCUI.RRF data analysis, we found that, for subject CUIs, the counts and ratios of retired CUIs are:

type	count of CUIs	ratio
retired	10734	3.54%
deleted	181	0.06%
injectively mapped	10182	3.36%
bijectively mapped	3109	1.03%
One-to-Many mapped	371	0.12%

and for object CUIs,

type	count of CUIs	ratio
retired	9364	3.57%
deleted	175	0.07%
injectively mapped	8856	3.37%
bijectively mapped	2678	1.02%
One-to-Many mapped	333	0.13%

It's a safe bet to consider only the deleted and bijectively mapped CUIs. Also it's worth considering only mappings with SY relationship.

erikyao commented 2 years ago

@andrewsu initialed the following policies toward retired CUIs and piped CUIs

For the one-to-one bijective mappings, I agree on the simple replacement.

Confirmed.

For the other injective many-to-one mappings, I think I'm also good with simple replacement.

Confirmed.

For the one-to-many, I think I'm good with just duplicating the original record multiple times, one for each of the mapped CUI2s. It seems like this would be a pretty modest increase in size.

Discussion Pending. Expansion size is small but might introduce duplicate contents among those expanded documents.

For the deletions, I might be okay just leaving them in. It's a faithful statement of what's stated, and I think for BTE, it pretty much will end up being ignored since the node normalizer will not know what to do with them.

Negative. Should delete when parsing/uploading. (Log if necessary.)

For piped subjects/objects, I seem to recall having this discussion with Sander previously, and that he implemented a solution where the record was duplicated multiple times, similar to how I'm proposing handling the one-to-many case.

New analysis required.

erikyao commented 2 years ago

Piped vs Non-Piped Rows

File semmedVER43_2022_R_PREDICATION.csv contains 117,589,597 rows. After removing rows with SUBJECT_NOVELTY == 0 or OBJECT_NOVELTY == 0, 81,282,024 rows remained, among which the distribution of rows with/without piped CUIs is:

row type	count	ratio
w/o piped CUIs	74,288,575	91.4%
with piped CUIs	6,993,449	8.6%

Rows with Retired CUIs

The distributions of the counts of rows containing retired CUIs among the two types of rows are listed below, where the ratios are calculated against the total number of rows (81,282,024):

status	piped or not?	count	ratio	remark
retired	:x:	4,101,513	5.05%
	:white_check_mark:	291,629	0.36%
(1) deleted	:x:	14,737	0.02%
	:white_check_mark:	3,536	0.004%
(2) injective	:x:	3,940,070	4.85%
	:white_check_mark:	283,839	0.35%
(2.1) bijective	:x:	1,052,756	1.30%
	:white_check_mark:	155,048	0.19%
(3) one-to-many	:x:	150,634	0.19%	Avg. out-degree 2.07
	:white_check_mark:	4,272	0.005%	Avg. out-degree 2.80

Note that if we create new predication for each mapped CUIs, those 150,634 rows with one-to-many mapped, non-piped CUIs will expand to 311,800 documents. The avg. out-degree is 2.07. Similarly, for the piped ones, it will expand to 11,935 documents with an avg. out-degree 2.80.

Impact of Splitting Policies on Piped CUIs

Note that in this section, retired CUIs (or the replacement plans) are not taken into consideration.

The current splitting policies were proposed here:

In cases where a UMLS CUI is followed by one or more numeric IDs (presumed to be NCBI Gene IDs) e.g., C0056207|3075, discard the numeric IDs and process as usual

In cases where the CUI only consists of one or more pipe-separated numeric IDs, create separate documents for each numeric id using the key ncbigene.

Following these policies, the 6,993,449 rows with piped CUIs will bring about 7,959,310 documents (1.14 docs per row). The total number of documents will be 7,959,310 + 74,288,575 = 82,247,885.

If we change the first policy and not discard any of the numeric IDs, we will find 17,335,870 documents generated from those 6,993,449 rows (2.48 docs per row). The total number of documents will come to 17,335,870 + 74,288,575 = 91,624,445, a 11.5% increase from the original policies.

In summary:

Splitting Policies	Rows with Piped CUIs	Documents from Piped Rows	Docs per Piped Row	Total Documents	Remark
current	6,993,449	7,959,310	1.14	82,247,885
new	6,993,449	17,335,870	2.48	91,624,445	11.5% :arrow_heading_up: in total

P.S. current https://biothings.ncats.io/semmeddb API has 114,383,742 documents but it includes docs with zero novelty score.

andrewsu commented 2 years ago

Great, let's handle these by group:

Deleted: Let's delete the 18,273 predications referencing deleted IDs
Injective: The 4,223,909 predications here with old IDs would essentially map to the same number of predications with new IDs (not accounting for piping). That seems reasonable, so let's do this. (This is also the biggest group, so this decision likely solves 98% of the issue here...)
One-to-many: So there are 154,906 predications with retired IDs that map to multiple new IDs. Can you easily calculate/estimate the number of predications that would turn into if you created new predications for every new ID? (E.g., if semmed says A -> B_ret and MRCUI says B_ret -> C and B_ret -> D, then we would create 2 predications for A->C and A->D). Again, ignore piping for now... (I'm guessing the number here will still be very low as a percentage of semmeddb overall, so I lean toward just doing this.)

Also, regarding piping, there are 6,993,449 predications with some piping. Can you calculate/estimate the number of predications that would turn into if you created a new predication for each ID in the pipe?

colleenXu commented 2 years ago

Using the same numbering as Andrew did in his post:

Deleted: I agree with Andrew on this.
Injective: My understanding is that this means each retired UMLS CUI is mapped to 1 non-retired UMLS CUI. I'm still fine with the simple replacement that was agreed on earlier. However, I think there's still the piping to consider (will discuss in next post)
One-to-many: hmmmm the UMLS browser seems to do something interesting here...it seems to map to the "first" non-retired ID in the table. Perhaps we can do the same thing and do simple replacement (like the injective case)? And there's still piping to consider (will discuss in next post)
- previously C0000826 was given as an example. However, when searching the browser, it seems like it was mapped to ONLY ONE other ID, C0043242 - which is the first row.
- I went into the dropbox file for more examples. In each case, the browser seems to show that it was mapped to the ID from the first row.
  - C0007236: has 6 RO rows, but browser maps to first row's ID C0007237
  - C0006874: has 2 RB rows, but browser maps to C0033882
  - C0068627: has 3 RB rows and 1 RN row, but browser maps to C0003779
  - C0078727: has 1 RN row and 1 RO row, but browser maps to C1307678

colleenXu commented 2 years ago

With piping:

Point A

@andrewsu the scope of the issue was somewhat discussed here. However, the full effect on predications wasn't clear. For example, are there cases where both the subject + object have piped IDs - and how much expansion would then happen?

Point B

I think there's still some vagueness: are there any combos of IDs in a piped thing where the IDs represent "equivalent" things, to the point where we don't want to expand to multiple records? For example: when there's 1 Entrez ID and 1 CUI, are those two IDs "equivalent" enough that we just want a record with 1 of the IDs (probably the Entrez one)? Maybe one way to tell "equivalent" is when it's easy to find a cross-mapping between the Entrez ID and the CUI (in MyGene for instance)?

Point C

On the other hand, I'm starting to be less concerned about the chance of having "duplicated information" from expanding piped IDs that are basically equivalent into multiple records (each record = 1 combo of subject ID and object ID). At least, I think BTE can kinda handle it.

For example, semmeddb currently has 3 records corresponding to the exact same triple + pmid. But when BTE is queried for that triple (see query details below), the edge only has one instance of that PMID (8959933) in its biolink:publications array. This means BTE runs set-like operations to get only unique values (maybe here). To some extent, I think BTE will process the API's response and merge/take-only-unique-values.

response from querying only semmeddb through BTE (POST to http://localhost:3000/v1/smartapi/1d288b3a3caf75d541ffaae3aab386c8/query locally):

only one instance of PMID:8959933 in this response

``` { "workflow": [ { "id": "lookup" } ], "message": { "query_graph": { "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": [ "biolink:location_of" ] } }, "nodes": { "n0": { "ids": [ "UMLS:C1521748" ], "categories": [ "biolink:GrossAnatomicalStructure", "biolink:AnatomicalEntity" ], "name": "Entire mastoid" }, "n1": { "ids": [ "UMLS:C0029440" ], "categories": [ "biolink:Disease" ], "name": "osteoma" } } }, "knowledge_graph": { "nodes": { "MONDO:0005166": { "categories": [ "biolink:Disease" ], "name": "osteoma", "attributes": [ { "attribute_type_id": "biolink:xref", "value": [ "MONDO:0005166", "UMLS:C0029440", "MESH:D010016", "MEDDRA:10031249", "NCIT:C3296", "SNOMEDCT:302858007", "SNOMEDCT:83612000", "HP:0100246" ] }, { "attribute_type_id": "biolink:synonym", "value": [ "osteoma", "Osteoma" ] }, { "attribute_type_id": "num_source_nodes", "value": 1 }, { "attribute_type_id": "num_target_nodes", "value": 0 }, { "attribute_type_id": "source_qg_nodes", "value": [ "n0" ] }, { "attribute_type_id": "target_qg_nodes", "value": [] } ] }, "UMLS:C1521748": { "categories": [ "biolink:AnatomicalEntity" ], "name": "Entire mastoid", "attributes": [ { "attribute_type_id": "biolink:xref", "value": [ "UMLS:C1521748" ] }, { "attribute_type_id": "biolink:synonym", "value": [ "Entire mastoid" ] }, { "attribute_type_id": "num_source_nodes", "value": 0 }, { "attribute_type_id": "num_target_nodes", "value": 1 }, { "attribute_type_id": "source_qg_nodes", "value": [] }, { "attribute_type_id": "target_qg_nodes", "value": [ "n1" ] } ] } }, "edges": { "aad0f5aae3e7a1ae1d5fa9772637ef2c": { "predicate": "biolink:location_of", "subject": "UMLS:C1521748", "object": "MONDO:0005166", "attributes": [ { "attribute_type_id": "biolink:aggregator_knowledge_source", "value": [ "infores:biothings-explorer" ], "value_type_id": "biolink:InformationResource" }, { "attribute_type_id": "biolink:primary_knowledge_source", "value": [ "infores:semmeddb" ], "value_type_id": "biolink:InformationResource" }, { "attribute_type_id": "biolink:aggregator_knowledge_source", "value": [ "infores:biothings-semmeddb" ], "value_type_id": "biolink:InformationResource" }, { "attribute_type_id": "biolink:publications", "value": [ "PMID:8959933", "PMID:578417", "PMID:24653880", "PMID:702417", "PMID:467282", "PMID:130459", "PMID:9441563", "PMID:4264661", "PMID:194342", "PMID:686306", "PMID:19977164", "PMID:19977176", "PMID:23377306", "PMID:23120331", "PMID:23532662", "PMID:2076317", "PMID:5381639", "PMID:7446839", "PMID:12802983", "PMID:13381266", "PMID:13717178", "PMID:13149044", "PMID:13749106", "PMID:13445597", "PMID:13760067", "PMID:14100724", "PMID:13379779", "PMID:13408609", "PMID:10721530", "PMID:5923152", "PMID:5808232", "PMID:5998528", "PMID:5014384", "PMID:28764209", "PMID:20554264", "PMID:14912570", "PMID:18357940", "PMID:18633930", "PMID:29255322", "PMID:1897711", "PMID:1742065", "PMID:2044485", "PMID:30199506", "PMID:16001697", "PMID:29392054", "PMID:21620788", "PMID:30388584", "PMID:15021765", "PMID:31750121", "PMID:31262987" ] }, { "attribute_type_id": "biolink:original_object", "value": "C0029440" }, { "attribute_type_id": "biolink:original_predicate", "value": "LOCATION_OF" }, { "attribute_type_id": "biolink:original_subject", "value": "C1521748" }, { "attribute_type_id": "original_object_name", "value": "Osteoma" }, { "attribute_type_id": "original_subject_name", "value": "Entire mastoid" } ] } } }, "results": [ { "node_bindings": { "n0": [ { "id": "UMLS:C1521748" } ], "n1": [ { "id": "MONDO:0005166" } ] }, "edge_bindings": { "e01": [ { "id": "aad0f5aae3e7a1ae1d5fa9772637ef2c" } ] }, "score": 2.6815700080727343 } ] }, "logs": [ { "timestamp": "2022-05-25T01:13:51.735Z", "level": "INFO", "message": "Node n0 with id [UMLS:C1521748] and category [biolink:GrossAnatomicalStructure] augmented with category [biolink:AnatomicalEntity] inferred from id.", "code": null }, { "timestamp": "2022-05-25T01:13:52.010Z", "level": "DEBUG", "message": "BTE identified 2 qNodes from your query graph", "code": null }, { "timestamp": "2022-05-25T01:13:52.010Z", "level": "DEBUG", "message": "BTE identified 1 qEdges from your query graph", "code": null }, { "timestamp": "2022-05-25T01:13:52.045Z", "level": "INFO", "message": "Executing e01: n0 --> n1", "code": null }, { "timestamp": "2022-05-25T01:13:52.318Z", "level": "DEBUG", "message": "REDIS cache is not enabled.", "code": null }, { "timestamp": "2022-05-25T01:13:52.319Z", "level": "DEBUG", "message": "BTE is trying to find metaKG edges (smartAPI registry, x-bte annotation) connecting from GrossAnatomicalStructure,CellularComponent,Cell,PathologicalAnatomicalStructure,AnatomicalEntity to Disease with predicate location_of", "code": null }, { "timestamp": "2022-05-25T01:13:52.351Z", "level": "DEBUG", "message": "BTE found 10 metaKG edges corresponding to e01. These metaKG edges comes from 1 unique APIs. They are BioThings SEMMEDDB API", "code": null }, { "timestamp": "2022-05-25T01:13:52.353Z", "level": "DEBUG", "message": "BTE found 10 metaKG for this batch.", "code": null }, { "timestamp": "2022-05-25T01:13:52.353Z", "level": "DEBUG", "message": "call-apis: Resolving ID feature is turned on", "code": null }, { "timestamp": "2022-05-25T01:13:52.353Z", "level": "DEBUG", "message": "call-apis: Number of API Edges received is 10", "code": null }, { "timestamp": "2022-05-25T01:13:52.834Z", "level": "DEBUG", "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 189 records, took 352ms)", "code": null }, { "timestamp": "2022-05-25T01:13:52.950Z", "level": "DEBUG", "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 267 records, took 474ms)", "code": null }, { "timestamp": "2022-05-25T01:13:53.303Z", "level": "DEBUG", "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 858 records, took 580ms)", "code": null }, { "timestamp": "2022-05-25T01:13:53.561Z", "level": "DEBUG", "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 34 records, took 235ms)", "code": null }, { "timestamp": "2022-05-25T01:13:53.571Z", "level": "DEBUG", "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 27 records, took 257ms)", "code": null }, { "timestamp": "2022-05-25T01:13:53.727Z", "level": "DEBUG", "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): Cell > location_of > Disease (obtained 267 records, took 306ms)", "code": null }, { "timestamp": "2022-05-25T01:13:53.962Z", "level": "DEBUG", "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 0 records, took 221ms)", "code": null }, { "timestamp": "2022-05-25T01:13:53.985Z", "level": "DEBUG", "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 0 records, took 250ms)", "code": null }, { "timestamp": "2022-05-25T01:13:54.169Z", "level": "DEBUG", "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): CellularComponent > location_of > Disease (obtained 267 records, took 341ms)", "code": null }, { "timestamp": "2022-05-25T01:13:54.530Z", "level": "DEBUG", "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): Cell > location_of > Disease (obtained 189 records, took 288ms)", "code": null }, { "timestamp": "2022-05-25T01:13:54.531Z", "level": "DEBUG", "message": "call-apis: Total number of records returned for this query is 2098", "code": null }, { "timestamp": "2022-05-25T01:13:55.356Z", "level": "DEBUG", "message": "call-apis: qEdge queries complete in 2s", "code": null }, { "timestamp": "2022-05-25T01:13:55.357Z", "level": "INFO", "message": "e01 execution: 10 queries (10 success/0 fail) and (0) cached qEdges return (2098) records", "code": null }, { "timestamp": "2022-05-25T01:13:52.045Z", "level": "DEBUG", "message": "Edge manager is managing 1 qEdges.", "code": null }, { "timestamp": "2022-05-25T01:13:52.045Z", "level": "DEBUG", "message": "Next qEdge will pick lower entity value to use for query.", "code": null }, { "timestamp": "2022-05-25T01:13:52.045Z", "level": "DEBUG", "message": "Edge manager is sending next qEdge 'e01' for execution.", "code": null }, { "timestamp": "2022-05-25T01:13:55.410Z", "level": "DEBUG", "message": "'e01' kept (252) / dropped (1846) records.", "code": null }, { "timestamp": "2022-05-25T01:13:55.414Z", "level": "INFO", "message": "'e01' keeps (252) records!", "code": null }, { "timestamp": "2022-05-25T01:13:55.414Z", "level": "DEBUG", "message": "Edge manager collected (252) records!", "code": null }, { "timestamp": "2022-05-25T01:13:55.453Z", "level": "DEBUG", "message": "Successfully scored 1 results, couldn't score 0 results.", "code": null }, { "timestamp": "2022-05-25T01:13:55.454Z", "level": "INFO", "message": "Execution Summary: (2) nodes / (1) edges / (1) results; (8/10) queries returned results from (1) unique APIs ", "code": null }, { "timestamp": "2022-05-25T01:13:55.454Z", "level": "INFO", "message": "APIs: BioThings SEMMEDDB API", "code": null } ] } ```

erikyao commented 2 years ago

Great, let's handle these by group:

Deleted: Let's delete the 18,273 predications referencing deleted IDs

Injective: The 4,223,909 predications here with old IDs would essentially map to the same number of predications with new IDs (not accounting for piping). That seems reasonable, so let's do this. (This is also the biggest group, so this decision likely solves 98% of the issue here...)

One-to-many: So there are 154,906 predications with retired IDs that map to multiple new IDs. Can you easily calculate/estimate the number of predications that would turn into if you created new predications for every new ID? (E.g., if semmed says A -> B_ret and MRCUI says B_ret -> C and B_ret -> D, then we would create 2 predications for A->C and A->D). Again, ignore piping for now... (I'm guessing the number here will still be very low as a percentage of semmeddb overall, so I lean toward just doing this.)

Also, regarding piping, there are 6,993,449 predications with some piping. Can you calculate/estimate the number of predications that would turn into if you created a new predication for each ID in the pipe?

@andrewsu @colleenXu, please find my updated comments above.

andrewsu commented 2 years ago

Fantastic, I think we are very close here. @erikyao, In this comment, you mention there are three classes of piped IDs:

1 UMLS + 1 Entrez 1 UMLS + N Entrez N Entrez

Can you post a sampling (maybe 20 examples) of the "1 UMLS + N Entrez" group? I'd just like to understand that group a bit better...

erikyao commented 2 years ago

Fantastic, I think we are very close here. @erikyao, In this comment, you mention there are three classes of piped IDs:

1 UMLS + 1 Entrez 1 UMLS + N Entrez N Entrez

Can you post a sampling (maybe 20 examples) of the "1 UMLS + N Entrez" group? I'd just like to understand that group a bit better...

1 UMLS + 13 Entrez, 7 examples

'C0074479|4489|4490|4493|4494|4495|4496|4498|4499|4500|4501|4543|56052|644314'
'Antigens,CD43|MT1A|MT1B|MT1E|MT1F|MT1G|MT1H|MT1JP|MT1M|MT1L|MT1X|MTNR1A|ALG1|MT1IP'

'C0682972|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'G-Protein-Coupled Receptors|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'

'C0597298|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Protein Isoforms|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'

'C0079427|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Tumor Suppressor Genes|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'

'C0017968|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Glycoproteins|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'

'C0033684|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Proteins|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'

'C0033371|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Prolactin|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'

1 UMLS + 8 Entrez, 4 examples

'C0002210|250|470|6590|10850|26033|27295|55226|80150'
'alpha-Fetoproteins|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1'

'C0212691|1523|4791|4940|6490|9733|22974|27044|84164'
'lyt-10 protein|CUX1|NFKB2|OAS3|PMEL|SART3|TPX2|SND1|ASCC2'

'C0126732|250|470|6590|10850|26033|27295|55226|80150'
'I Kappa B-Alpha|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1'

'C0600251|250|470|6590|10850|26033|27295|55226|80150'
'Interleukin-1 alpha|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1'

1 UMLS + 5 Entrez, 3 examples

'C0085828|2353|2354|3725|3726|3727'
'Transcription Factor AP-1|FOS|FOSB|JUN|JUNB|JUND'

'C0083957|3854|3872|5126|5311|8535'
'Proprotein Convertase 2|KRT6B|KRT17|PCSK2|PKD2|CBX4'

'C0135615|3853|5122|7832|10120|57332'
'Proprotein Convertase 1|KRT6A|PCSK1|BTG2|ACTR1B|CBX8'

1 UMLS + 3 Entrezs, 7 examples

'C1141639|1081|3342|93659'
'Human Chorionic Gonadotropin|CGA|HTC2|CGB5'

'C0007082|1048|1084|5670'
'Carcinoembryonic Antigen|CEACAM5|CEACAM3|PSG2'

'C0968902|2167|2971|7020'
'Transcription Factor AP-2 Alpha|FABP4|GTF3A|TFAP2A'

'C1335440|100616102|100862685|100862688'
'Polymerase Gene|ERVK-9|ERVK-19|ERVK-11'

'C1335439|100616102|100862685|100862688'
'Polymerase|ERVK-9|ERVK-19|ERVK-11'

'C0035681|100616102|100862685|100862688'
'DNA-Directed RNA Polymerase|ERVK-9|ERVK-19|ERVK-11'

'C0012892|100616102|100862685|100862688'
'DNA-Directed DNA Polymerase|ERVK-9|ERVK-19|ERVK-11'

colleenXu commented 2 years ago

For "1 UMLS + N Entrez", it seems like the UMLS ID and the Entrez IDs are not equivalent. Then maybe we want to change the current splitting policy: "In cases where a UMLS CUI is followed by one or more numeric IDs (presumed to be NCBI Gene IDs) e.g., C0056207|3075, discard the numeric IDs and process as usual"? Change to not discarding the numeric IDs?

Has "Point B" above been explored? I was wondering if the "1 UMLS + 1 Entrez" are equivalent.

andrewsu commented 2 years ago

Perhaps a generic way of handling the case of "1 UMLS + N Entrez" (including "1 UMLS + 1 Entrez") is to keep all Entrez IDs and create multiple records unless an Entrez ID also maps to the UMLS ID according to the Node Normalizer. Thoughts?

colleenXu commented 2 years ago

I think it's an interesting idea. Would we want to use MyGene, rather than Node Normalizer?

For example, one can query either the entrezgene field and then look at the umls field or vice versa...

Here's an example using the

'C0012892|100616102|100862685|100862688'
'DNA-Directed DNA Polymerase|ERVK-9|ERVK-19|ERVK-11'

POST to https://mygene.info/v3/query?fields=entrezgene,umls,symbol,name,taxid:

{
    "q": "100616102,100862685,100862688",
    "scopes": "entrezgene"
}

Response. Notice that none of the umls ids returned match C0012892 / DNA-Directed DNA Polymerase

[
    {
        "query": "100616102",
        "_id": "100616102",
        "_score": 26.72278,
        "entrezgene": "100616102",
        "name": "endogenous retrovirus group K member 9",
        "symbol": "ERVK-9",
        "taxid": 9606,
        "umls": {
            "cui": "C3147204"
        }
    },
    {
        "query": "100862685",
        "notfound": true
    },
    {
        "query": "100862688",
        "_id": "100862688",
        "_score": 25.927315,
        "entrezgene": "100862688",
        "name": "endogenous retrovirus group K member 11",
        "symbol": "ERVK-11",
        "taxid": 9606,
        "umls": {
            "cui": "C3147206"
        }
    }
]

andrewsu commented 2 years ago

Would we want to use MyGene, rather than Node Normalizer?

I think we should use Node Normalizer (assuming we can figure out batch querying via POST). Unless there is any other discussion or dissent, @erikyao please implement this behavior that I described in this comment.

colleenXu commented 2 years ago

Noting how RTX-KG2 is doing it:

Less intensive than what we are doing: https://github.com/RTXteam/RTX-KG2/blob/1d3cb6da97a16647de5b5ccf67b1ffcb2f4c3c73/semmeddb_tuple_list_json_to_kg_json.py#L34-L51

colleenXu commented 2 years ago

@andrewsu @erikyao did we ever decide on the one-to-many retired ID issue (andrew's post, my post)? This is before we get into pipes, where 1 retired ID is mapped to multiple current IDs.

erikyao commented 2 years ago

@andrewsu @erikyao did we ever decide on the one-to-many retired ID issue (andrew's post, my post)? This is before we get into pipes, where 1 retired ID is mapped to multiple current IDs.

Hi @colleenXu , I think @andrewsu suggested replacement with all the mapped new IDs. Quote:

E.g., if semmed says A -> B_ret and MRCUI says B_ret -> C and B_ret -> D, then we would create 2 predications for A->C and A->D

andrewsu commented 2 years ago

Given the small expansion in triples based on Yao's updated comment, yes, I think we proceed with the plan that @erikyao quoted the comment above...

E.g., if semmed says A -> B_ret and MRCUI says B_ret -> C and B_ret -> D, then we would create 2 predications for A->C and A->D

biothings / semmeddb