Implement automatic version checking, but keep manual file dumping
(Closes #46) Automated dumping is not possible because the files are behind an authorization portal. Summary of workflow:
Scrape UMLS site to find latest version and dump a dummy file (release notes), which triggers uploader.
Uploader fails with a message to download zip file manualy.
Relevant files are extracted from zip file using open_anyfile utility.
Revise document merging strategy
Previously, documents were being merged with on_duplicates set to "ignore". I think this is not the best merging strategy.
Duplicate ids happen because we query mydisease.info to fetch _id of documents, and UMLS can have a many-to-many relationship with the primary key.
For example:
MONDO:0005160 is mapped to multiple CUIs: ['C0003486', 'C0265010', 'C0265012', 'C0741160', 'C1305122'], which results in multiple UMLS documents with _id = MONDO:0005160
Implemeted solution
Print logging messages to inform when documents have duplicate _ids
Resolve the duplicate _id issue by using MergerStorage, which combines individual fields.
In the above case, the merged document looks like this:
Implement automatic version checking, but keep manual file dumping
(Closes #46) Automated dumping is not possible because the files are behind an authorization portal. Summary of workflow:
open_anyfile
utility.Revise document merging strategy
Previously, documents were being merged with
on_duplicates
set to "ignore". I think this is not the best merging strategy.Duplicate ids happen because we query mydisease.info to fetch _id of documents, and UMLS can have a many-to-many relationship with the primary key.
For example:
MONDO:0005160
is mapped to multiple CUIs:['C0003486', 'C0265010', 'C0265012', 'C0741160', 'C1305122']
, which results in multiple UMLS documents with_id = MONDO:0005160
Implemeted solution
In the above case, the merged document looks like this:
Additional fixes:
disgenet.xrefs.umls
as an additional scope when querying for _id.