Closed andrewsu closed 1 year ago
It's feasible with a customized RootKeyMergerStorage.
Note that {"on_duplicates" : "merge"}
config in the manifest.json
(as shown below) is NOT capable of solving this problem because it only finds duplicates on _id
.
{
"uploaders" : [
{
"parser" : "<path.to.module>:<function_name_1>",
"on_duplicates" : "ignore|error|merge" # optional, default to "error"
}
]
}
At the 2023-03-01 team meeting, we agreed that the proposed structure would be better. We may need to put in some caps on the number of items in the evidence
array, since I seem to remember that the count can go 10,000+.
So, let's move forward with this ticket and #563 in parallel. Assigning to @erikyao...
Q: Does it mean we shall group all the documents by the triple of (object.umls, predicate, subject.umls)
? Since object.name
and subject.name
are not standardized across the dataset.
A: I think the answer is yes that we want to "group all the documents by the triple of (object.umls, predicate, subject.umls)
". Beyond that, I think we want to do what's easiest during this exploratory period. So I think it'd be great to add all unique values in an array for object.name
and subject.name
, but choosing one arbitrarily is fine for the moment too.
Q: Would you prefer taking the semtypes into account? I.e. grouping by (object.umls, object.semtpye, predicate, subject.umls, subject.semtype)
? Since one UMLS CUI may be mapped to multiple semtypes.
A: My gut says that we can also just combine unique values of semtypes in an array as well, but I'd like @colleenXu to chime in on whether that's an issue for the SmartAPI annotation. If that is a problem, then perhaps we need to aggregate based on unique records for (object.umls, object.semtype, predicate, subject.umls, subject.semtype)
.
{
'_id': 'C0033011-PART_OF-C0032105',
'predication_id': [
27232377,
31238107,
40075886,
72239153,
72300825,
72303926,
72715980
],
'pmid': [27753038, 1268148, 4276847, 23690098, 26287784, 26287791, 24659861],
'predicate': 'PART_OF',
'subject': {
'umls': 'C0033011',
'name': 'Plasma',
'semantic_type_abbreviation': 'bdsu',
'semantic_type_name': 'Body Substance',
'novelty': 1
},
'object': {
'umls': 'C0032105',
'name': 'Pregnant Women',
'semantic_type_abbreviation': 'humn',
'semantic_type_name': 'Human',
'novelty': 1
}
}
{
'_id': 'C0012524-PART_OF-C0012854',
'predication_id': [72324048, 72324366],
'pmid': 16626110,
'predicate': 'PART_OF',
'subject': {
'umls': 'C0012524',
'name': 'DNA',
'semantic_type_abbreviation': 'bacs',
'semantic_type_name': 'Biologically Active Substance',
'novelty': 1
},
'object': {
'umls': 'C0012854',
'name': 'Catechol Oxidase',
'semantic_type_abbreviation': 'aapp',
'semantic_type_name': 'Amino Acid, Peptide, or Protein',
'novelty': 1
}
},
So I'm guessing that most records will be like the first example (7 Predications with 7 PMIDs), or at least most people will assume that is the more common pattern. And I get that it can get even more confusing than the second example -- when we have 3 predications and 2 PMIDs, then we lose the mappings between predications and PMIDs. Having said all that, I think this structure is fine. For now, the pmid
field will be interpreted as "supporting articles", and just showing the set of unique values from which the predications are drawn will be just fine.
For the second example, we can also make it
{
"predicate": "PART_OF"
"predication": [
{
"predication_id": 72324048,
"pmid": "16626110",
# "sentence_id": "omitted"
# "sentence": "omitted"
},
{
"predication_id": 72324366,
"pmid": "16626110",
# "sentence_id": "omitted"
# "sentence": "omitted"
},
],
"subject": { ... }
"object": { ... }
}
Oh right, that's even better. That's similar to what I proposed in my initial comment, but apparently that flash of brilliance quickly escaped my brain... ;)
Code is forked to my own repo: https://github.com/erikyao/semmeddb2. An initial fix is committed already. Will deploy as a new API semmeddb2
tonight.
Considering ITRB might not sync the code change over the weekend. I will redirect access to semmeddb2
to the old biothings.ncats.io
server (some NGINX config change needed).
I've made a separate SmartAPI yaml registration for this API: https://smart-api.info/registry?q=5634bdb9a1315f85f595979eaf90a504
It has the same operations/MetaEdges as the current SEMMEDDB API.
The main differences are:
requestBody
values will generate queries with a pmid_count:>5
filter (which requires an empty-scopes format)infores:biothings-semmeddb-2
(which isn't a true infores) in the yaml's info.x-translator.infores
field. This was meant to avoid confusion with the current semmeddb parameters.fields
values and response-mapping are modified so the following fields would be ingested by BTE
pmid
field)requestBody
values don't include subject.novelty
/ object.novelty filtering because according to Yao it's unnecessary (his parser already removes records where novelty != 1). closing this one as complete
Updated the notebook / yamls https://github.com/NCATS-Tangerine/translator-api-registry/commit/794ffc25f5f6cfbb3c27b5fd87dcb485d19ac610 + registration so BTE will ingest the supporting sentences and so BTE will use the filter >1 (agreed upon by Translator (Slack link)).
Next step is moving BTE from using the older "semmeddb" to using "semmeddb2"...
The separate registration linked above has been DELETED. From now on, use the semmeddb registration ID to access "semmeddb2" functionality https://smart-api.info/registry?q=1d288b3a3caf75d541ffaae3aab386c8.
This is because we decided that we liked this refactored document structure, and we want BTE to use it. The simplest way to switch BTE over to using semmeddb2 is to modify the registration of "semmeddb" that BTE is currently hooked up to.
This commit was done by replacing the contents of the "semmeddb" files:
Then the registration for "semmedb" was refreshed, and the old semmeddb registration was deleted.
Right now, we have a distinct record for every row in the PREDICATIONS file. For example, we currently have 73 documents for "imatinib TREATS leukemia", each differing by the
pmid
andpredication_id
: https://biothings.transltr.io/semmeddb/query?q=subject.umls:C0935989%20AND%20object.umls:C0023418%20AND%20predicate:TREATSWe could also consider a design in which rows are combined based on unique subjects, predicates, and objects, collapsing the evidence into an array. For the example above:
That would result in fewer documents, but some documents that are potentially very big (especially if we add sentences as is being explored in https://github.com/biothings/BioThings_Explorer_TRAPI/issues/563). BTE does this collapsing eventually anyway, and making that change at the API level would give us the ability to add a
pmid_count
field (which we could use and dynamically adjust in queries). Creating this issue to track other pros/cons, feasibility analysis, etc.