Closed erikyao closed 1 year ago
@andrewsu initialed the following policies toward retired CUIs and piped CUIs
For the one-to-one bijective mappings, I agree on the simple replacement.
Confirmed.
For the other injective many-to-one mappings, I think I'm also good with simple replacement.
Confirmed.
For the one-to-many, I think I'm good with just duplicating the original record multiple times, one for each of the mapped CUI2s. It seems like this would be a pretty modest increase in size.
Discussion Pending. Expansion size is small but might introduce duplicate contents among those expanded documents.
For the deletions, I might be okay just leaving them in. It's a faithful statement of what's stated, and I think for BTE, it pretty much will end up being ignored since the node normalizer will not know what to do with them.
Negative. Should delete when parsing/uploading. (Log if necessary.)
For piped subjects/objects, I seem to recall having this discussion with Sander previously, and that he implemented a solution where the record was duplicated multiple times, similar to how I'm proposing handling the one-to-many case.
New analysis required.
File semmedVER43_2022_R_PREDICATION.csv
contains 117,589,597
rows. After removing rows with SUBJECT_NOVELTY == 0
or OBJECT_NOVELTY == 0
, 81,282,024
rows remained, among which the distribution of rows with/without piped CUIs is:
row type | count | ratio |
---|---|---|
w/o piped CUIs | 74,288,575 | 91.4% |
with piped CUIs | 6,993,449 | 8.6% |
The distributions of the counts of rows containing retired CUIs among the two types of rows are listed below, where the ratios are calculated against the total number of rows (81,282,024
):
status | piped or not? | count | ratio | remark |
---|---|---|---|---|
retired | :x: | 4,101,513 | 5.05% | |
:white_check_mark: | 291,629 | 0.36% | ||
(1) deleted | :x: | 14,737 | 0.02% | |
:white_check_mark: | 3,536 | 0.004% | ||
(2) injective | :x: | 3,940,070 | 4.85% | |
:white_check_mark: | 283,839 | 0.35% | ||
(2.1) bijective | :x: | 1,052,756 | 1.30% | |
:white_check_mark: | 155,048 | 0.19% | ||
(3) one-to-many | :x: | 150,634 | 0.19% | Avg. out-degree 2.07 |
:white_check_mark: | 4,272 | 0.005% | Avg. out-degree 2.80 |
Note that if we create new predication for each mapped CUIs, those 150,634 rows with one-to-many mapped, non-piped CUIs will expand to 311,800
documents. The avg. out-degree is 2.07
. Similarly, for the piped ones, it will expand to 11,935
documents with an avg. out-degree 2.80
.
Note that in this section, retired CUIs (or the replacement plans) are not taken into consideration.
The current splitting policies were proposed here:
- In cases where a UMLS CUI is followed by one or more numeric IDs (presumed to be NCBI Gene IDs) e.g.,
C0056207|3075
, discard the numeric IDs and process as usual- In cases where the CUI only consists of one or more pipe-separated numeric IDs, create separate documents for each numeric id using the key ncbigene.
Following these policies, the 6,993,449
rows with piped CUIs will bring about 7,959,310
documents (1.14
docs per row). The total number of documents will be 7,959,310 + 74,288,575 = 82,247,885
.
If we change the first policy and not discard any of the numeric IDs, we will find 17,335,870
documents generated from those 6,993,449
rows (2.48
docs per row). The total number of documents will come to 17,335,870 + 74,288,575 = 91,624,445
, a 11.5%
increase from the original policies.
In summary:
Splitting Policies | Rows with Piped CUIs | Documents from Piped Rows | Docs per Piped Row | Total Documents | Remark |
---|---|---|---|---|---|
current | 6,993,449 | 7,959,310 | 1.14 | 82,247,885 | |
new | 6,993,449 | 17,335,870 | 2.48 | 91,624,445 | 11.5% :arrow_heading_up: in total |
P.S. current https://biothings.ncats.io/semmeddb API has 114,383,742
documents but it includes docs with zero novelty score.
Great, let's handle these by group:
Also, regarding piping, there are 6,993,449 predications with some piping. Can you calculate/estimate the number of predications that would turn into if you created a new predication for each ID in the pipe?
Using the same numbering as Andrew did in his post:
With piping:
@andrewsu the scope of the issue was somewhat discussed here. However, the full effect on predications wasn't clear. For example, are there cases where both the subject + object have piped IDs - and how much expansion would then happen?
I think there's still some vagueness: are there any combos of IDs in a piped thing where the IDs represent "equivalent" things, to the point where we don't want to expand to multiple records? For example: when there's 1 Entrez ID and 1 CUI, are those two IDs "equivalent" enough that we just want a record with 1 of the IDs (probably the Entrez one)? Maybe one way to tell "equivalent" is when it's easy to find a cross-mapping between the Entrez ID and the CUI (in MyGene for instance)?
On the other hand, I'm starting to be less concerned about the chance of having "duplicated information" from expanding piped IDs that are basically equivalent into multiple records (each record = 1 combo of subject ID and object ID). At least, I think BTE can kinda handle it.
For example, semmeddb currently has 3 records corresponding to the exact same triple + pmid. But when BTE is queried for that triple (see query details below), the edge only has one instance of that PMID (8959933) in its biolink:publications
array. This means BTE runs set-like operations to get only unique values (maybe here). To some extent, I think BTE will process the API's response and merge/take-only-unique-values.
response from querying only semmeddb through BTE (POST to http://localhost:3000/v1/smartapi/1d288b3a3caf75d541ffaae3aab386c8/query locally):
Great, let's handle these by group:
- Deleted: Let's delete the 18,273 predications referencing deleted IDs
- Injective: The 4,223,909 predications here with old IDs would essentially map to the same number of predications with new IDs (not accounting for piping). That seems reasonable, so let's do this. (This is also the biggest group, so this decision likely solves 98% of the issue here...)
- One-to-many: So there are 154,906 predications with retired IDs that map to multiple new IDs. Can you easily calculate/estimate the number of predications that would turn into if you created new predications for every new ID? (E.g., if semmed says A -> B_ret and MRCUI says B_ret -> C and B_ret -> D, then we would create 2 predications for A->C and A->D). Again, ignore piping for now... (I'm guessing the number here will still be very low as a percentage of semmeddb overall, so I lean toward just doing this.)
Also, regarding piping, there are 6,993,449 predications with some piping. Can you calculate/estimate the number of predications that would turn into if you created a new predication for each ID in the pipe?
@andrewsu @colleenXu, please find my updated comments above.
Fantastic, I think we are very close here. @erikyao, In this comment, you mention there are three classes of piped IDs:
1 UMLS + 1 Entrez 1 UMLS + N Entrez N Entrez
Can you post a sampling (maybe 20 examples) of the "1 UMLS + N Entrez" group? I'd just like to understand that group a bit better...
Fantastic, I think we are very close here. @erikyao, In this comment, you mention there are three classes of piped IDs:
1 UMLS + 1 Entrez 1 UMLS + N Entrez N Entrez
Can you post a sampling (maybe 20 examples) of the "1 UMLS + N Entrez" group? I'd just like to understand that group a bit better...
'C0074479|4489|4490|4493|4494|4495|4496|4498|4499|4500|4501|4543|56052|644314'
'Antigens,CD43|MT1A|MT1B|MT1E|MT1F|MT1G|MT1H|MT1JP|MT1M|MT1L|MT1X|MTNR1A|ALG1|MT1IP'
'C0682972|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'G-Protein-Coupled Receptors|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'
'C0597298|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Protein Isoforms|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'
'C0079427|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Tumor Suppressor Genes|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'
'C0017968|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Glycoproteins|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'
'C0033684|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Proteins|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'
'C0033371|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Prolactin|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'
'C0002210|250|470|6590|10850|26033|27295|55226|80150'
'alpha-Fetoproteins|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1'
'C0212691|1523|4791|4940|6490|9733|22974|27044|84164'
'lyt-10 protein|CUX1|NFKB2|OAS3|PMEL|SART3|TPX2|SND1|ASCC2'
'C0126732|250|470|6590|10850|26033|27295|55226|80150'
'I Kappa B-Alpha|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1'
'C0600251|250|470|6590|10850|26033|27295|55226|80150'
'Interleukin-1 alpha|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1'
'C0085828|2353|2354|3725|3726|3727'
'Transcription Factor AP-1|FOS|FOSB|JUN|JUNB|JUND'
'C0083957|3854|3872|5126|5311|8535'
'Proprotein Convertase 2|KRT6B|KRT17|PCSK2|PKD2|CBX4'
'C0135615|3853|5122|7832|10120|57332'
'Proprotein Convertase 1|KRT6A|PCSK1|BTG2|ACTR1B|CBX8'
'C1141639|1081|3342|93659'
'Human Chorionic Gonadotropin|CGA|HTC2|CGB5'
'C0007082|1048|1084|5670'
'Carcinoembryonic Antigen|CEACAM5|CEACAM3|PSG2'
'C0968902|2167|2971|7020'
'Transcription Factor AP-2 Alpha|FABP4|GTF3A|TFAP2A'
'C1335440|100616102|100862685|100862688'
'Polymerase Gene|ERVK-9|ERVK-19|ERVK-11'
'C1335439|100616102|100862685|100862688'
'Polymerase|ERVK-9|ERVK-19|ERVK-11'
'C0035681|100616102|100862685|100862688'
'DNA-Directed RNA Polymerase|ERVK-9|ERVK-19|ERVK-11'
'C0012892|100616102|100862685|100862688'
'DNA-Directed DNA Polymerase|ERVK-9|ERVK-19|ERVK-11'
For "1 UMLS + N Entrez", it seems like the UMLS ID and the Entrez IDs are not equivalent. Then maybe we want to change the current splitting policy: "In cases where a UMLS CUI is followed by one or more numeric IDs (presumed to be NCBI Gene IDs) e.g., C0056207|3075, discard the numeric IDs and process as usual"? Change to not discarding the numeric IDs?
Has "Point B" above been explored? I was wondering if the "1 UMLS + 1 Entrez" are equivalent.
Perhaps a generic way of handling the case of "1 UMLS + N Entrez" (including "1 UMLS + 1 Entrez") is to keep all Entrez IDs and create multiple records unless an Entrez ID also maps to the UMLS ID according to the Node Normalizer. Thoughts?
I think it's an interesting idea. Would we want to use MyGene, rather than Node Normalizer?
For example, one can query either the entrezgene field and then look at the umls field or vice versa...
Here's an example using the
'C0012892|100616102|100862685|100862688'
'DNA-Directed DNA Polymerase|ERVK-9|ERVK-19|ERVK-11'
POST to https://mygene.info/v3/query?fields=entrezgene,umls,symbol,name,taxid:
{
"q": "100616102,100862685,100862688",
"scopes": "entrezgene"
}
Response. Notice that none of the umls ids returned match C0012892 / DNA-Directed DNA Polymerase
[
{
"query": "100616102",
"_id": "100616102",
"_score": 26.72278,
"entrezgene": "100616102",
"name": "endogenous retrovirus group K member 9",
"symbol": "ERVK-9",
"taxid": 9606,
"umls": {
"cui": "C3147204"
}
},
{
"query": "100862685",
"notfound": true
},
{
"query": "100862688",
"_id": "100862688",
"_score": 25.927315,
"entrezgene": "100862688",
"name": "endogenous retrovirus group K member 11",
"symbol": "ERVK-11",
"taxid": 9606,
"umls": {
"cui": "C3147206"
}
}
]
Would we want to use MyGene, rather than Node Normalizer?
I think we should use Node Normalizer (assuming we can figure out batch querying via POST). Unless there is any other discussion or dissent, @erikyao please implement this behavior that I described in this comment.
Noting how RTX-KG2 is doing it:
@andrewsu @erikyao did we ever decide on the one-to-many retired ID issue (andrew's post, my post)? This is before we get into pipes, where 1 retired ID is mapped to multiple current IDs.
Hi @colleenXu , I think @andrewsu suggested replacement with all the mapped new IDs. Quote:
E.g., if semmed says A -> B_ret and MRCUI says B_ret -> C and B_ret -> D, then we would create 2 predications for A->C and A->D
Given the small expansion in triples based on Yao's updated comment, yes, I think we proceed with the plan that @erikyao quoted the comment above...
E.g., if semmed says A -> B_ret and MRCUI says B_ret -> C and B_ret -> D, then we would create 2 predications for A->C and A->D
File
semmedVER43_2022_R_PREDICATION.csv
contains 117,589,597 rows. After removing rows withSUBJECT_NOVELTY == 0
orOBJECT_NOVELTY == 0
, 81,282,024 rows remained. Among those rows, there are 303,080 unique subject CUIs, and 262,268 unique object CUIs (piped CUIs decomposed and counted).Following MRCUI.RRF data analysis, we found that, for subject CUIs, the counts and ratios of retired CUIs are:
and for object CUIs,
It's a safe bet to consider only the deleted and bijectively mapped CUIs. Also it's worth considering only mappings with
SY
relationship.