Georgetown-IR-Lab / QuickUMLS

System for Medical Concept Extraction and Linking
MIT License
369 stars 95 forks source link

Installing takes too much memory #64

Open fschlatt opened 3 years ago

fschlatt commented 3 years ago

When running python -m quickumls.install on an MRCONSO.RRF file with about 7M rows, the memory footprint continuously grows and some point the process is killed because of using too much memory. The two main culprits I could find are the processed https://github.com/Georgetown-IR-Lab/QuickUMLS/blob/c0b5db059fbef8d70681626a34456ab3d906e5e7/quickumls/install.py#L66 and simstring https://github.com/Georgetown-IR-Lab/QuickUMLS/blob/c0b5db059fbef8d70681626a34456ab3d906e5e7/quickumls/install.py#L113 sets.

I assume they are there to prevent duplicate entries in the SimString and CuiSemType DBs. When using the unqlite database, a check for duplicate entries is implemented on the insert call. So duplicate entries are a non issue. However, I am not sure if the same is true for the SimString database. Is it safe to add a duplicate terms/n-grams to the SimString database or will that break anything? This would then allow removing the memory overhead from the large sets for large UMLS subsets.

CatalinaZ16 commented 3 years ago

Hi! have you solve that problem?, I have the same :(

fschlatt commented 3 years ago

Hi! have you solve that problem?, I have the same :(

Sort of. At the cost of including some duplicates in the SimString database, I was able to reduce the RAM footprint by a significant amount. It now runs for the whole UMLS on my 16G RAM machine. Take a look at my fork of the repository for the fixes.

soldni commented 3 years ago

Hi Ferdinand,

Thank you so much for following up on this! Would you be willing to make a pull request for this? I would be happy to review it and merge it in the core package.

Best, Luca

On Feb 12, 2021, at 00:28, Ferdinand Schlatt notifications@github.com wrote:

 Hi! have you solve that problem?, I have the same :(

Sort of. At the cost of including some duplicates in the SimString database, I was able to reduce the RAM footprint by a significant amount. It now runs for the whole UMLS on my 16G RAM machine. Take a look at my fork of the repository for the fixes.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

fschlatt commented 3 years ago

Hey Luca,

Sure thing. I've also added that the preferred term is returned and applied black formatting to the repo, so there are a couple of additional changes. I'll create a pull request with my entire fork and we can discuss there, which parts are necessary and which are superfluous.

Best, Ferdinand

soldni commented 3 years ago

Great, I'll try to review over the weekend!

Best, Luca

On Fri, Feb 12, 2021 at 7:33 AM Ferdinand Schlatt notifications@github.com wrote:

Hey Luca,

Sure thing. I've also added that the preferred term is returned and applied black formatting to the repo, so there are a couple of additional changes. I'll create a pull request with my entire fork and we can discuss there, which parts are necessary and which are superfluous.

Best, Ferdinand

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Georgetown-IR-Lab/QuickUMLS/issues/64#issuecomment-778265311, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA53OIWV5KU4HDUSCCZHWHDS6VC6JANCNFSM4TBAQF5Q .

jimhavrilla commented 3 years ago

Seems like this from @fschlatt may be the fix https://github.com/Georgetown-IR-Lab/QuickUMLS/commit/76513933f5a311b2d2c4da06b16314f65c646e22. I had to drastically increase my RAM for the install as well.

jmugan commented 3 years ago

I ran into this as well. I have 16 GB of memory. Is the recommended approach implementing the changes from the comment above?

jmugan commented 3 years ago

I got it to work by being more selective about what I extracted from UMLS.