Georgetown-IR-Lab / QuickUMLS

System for Medical Concept Extraction and Linking
MIT License
369 stars 95 forks source link

Any suggestions on how to scale QuickUMLS? #54

Closed svjan5 closed 4 years ago

svjan5 commented 4 years ago

I encounter the error listed on issue #16 on trying to query QuickUMLS through multiple threads. However, if I restrict the number of threads to 1 then everything works perfectly. Do you have any suggestions for scaling your model for processing multiple documents simultaneously?

Currently, I tried a very ad-hoc solution of creating multiple directories of the QuickUMLS folder and assigning threads to different directories. Although this works I am wasting a lot of disk space. Any better solution will be highly appreciated.

Thanks and regards,

soldni commented 4 years ago

Hi Shikhar,

Can you explain what is the use case for querying QuickUMLS from multiple threads? are you trying to call QuickUMLS repeatedly to annotate a large number of documents, or is QuickUMLS a step in a longer document processing pipeline that needs to run in parallel?

If it is the former, QuickUMLS is unfortunately unable to run into a multi-processing mode due to its dependency on the current leveldb client. An option would be to either replace the client with another implementation that supports read-only snapshots, like Plyvel, or switch to other databases that support read only access, like RocksDB. I can look into that, and we welcome PRs if you can get it up and running, too!

If the issue is just querying QuickUMLS from multiple processes, I would take advantage of the existing client-server architecture instead. That’s documented here, but in short: you can spin up a server instance of QuickUMLS, and query that from multiple clients.

Looking forward to hearing more about your use case!

Best, Luca

svjan5 commented 4 years ago

Hi @soldni, Thanks for your response. I am currently working on implementing something like bert-as-service for medical entity linking. I am including your model also as one of the options but currently, as I increase the number of workers > 1 it crashes. It would be great if you could add that feature in your code so that I can include it in my work directly.

The code will be released here within a few days from now. I hope you will like the work.

Regarding the client-server, I think it will be capable of handling only one request at a time. Hence, it will also become a bottleneck during scaling.

soldni commented 4 years ago

Hi @svjan5,

I ended up including an alternative database option (unqlite, which I discovered today!) that supports multi-thread/process reads. Could you please test it from branch soldni/conc? It does require you to re-build the QuickUMLS index using python -m quickumls.install -d unqlite, but there are no changes in the APIs.

Please let me know how it goes!

Best, Luca

svjan5 commented 4 years ago

Thanks, @soldni, I will definitely check and will let you know how it is working.

soldni commented 4 years ago

Released 1.4.0