AstraZeneca / KAZU

Fast, world class biomedical NER
https://AstraZeneca.github.io/KAZU/
Apache License 2.0
76 stars 8 forks source link

Extending Kazu #20

Open raylite opened 9 months ago

raylite commented 9 months ago

I am new to Kazu and quite fascinated by it, but I want to find out if Kazu is flexible to the point that a developer can bring an additional (or custom) ontology/knowledgebase in addition to what's already in use for a certain entity or even diasble what's built into Kazu just to use a different one?

EFord36 commented 9 months ago

oops - just hit the 'comment and close issue button' by accident midway through writing a reply, sorry! real reply pending

EFord36 commented 9 months ago

Yes, Kazu is very flexible - the downside is that it's so flexible, we haven't yet done a great job of documenting all that flexibility.

In order to bring an additional (custom) ontology/kb, you would need to build your own model pack. Doing this is something we have had on our backlog to document for a while, but don't have anything good yet unfortunately.

One note is that we currently in the process of releasing a new version of Kazu - 2.0 . This doesn't change much for a user of the default model pack, but changes some of the details of providing config for 'Curating' knowledge bases to e.g. filter out bad synonyms for NER. How urgently are you looking at this - if I waited until the new version is out sometime next week to give you a proper guide, would that be ok for you, or would you rather than something to get you started sooner, even if it means some re-work if you want to upgrade to 2.0 later?

EFord36 commented 9 months ago

Disabling some of the existing ontologies alone has one way of doing it that should be considerably simpler - with the downside that the string matching facilities of Kazu will still have the disabled ontology 'baked in' (which will take up memory, but shouldn't affect compute much), unless the model pack was rebuilt. Is this something you're interested in, or mainly the adding of additional ontologies, and therefore building a custom model pack?

raylite commented 9 months ago

Yes, Kazu is very flexible - the downside is that it's so flexible, we haven't yet done a great job of documenting all that flexibility.

In order to bring an additional (custom) ontology/kb, you would need to build your own model pack. Doing this is something we have had on our backlog to document for a while, but don't have anything good yet unfortunately.

One note is that we currently in the process of releasing a new version of Kazu - 2.0 . This doesn't change much for a user of the default model pack, but changes some of the details of providing config for 'Curating' knowledge bases to e.g. filter out bad synonyms for NER. How urgently are you looking at this - if I waited until the new version is out sometime next week to give you a proper guide, would that be ok for you, or would you rather than something to get you started sooner, even if it means some re-work if you want to upgrade to 2.0 later?

Yes, I can wait until the new model is out, so I am working with the latest version once and for all. Next week is not bad for me. I work day-day in this domain and have built a similar tool for my org, I see common approaches, themes and packages like ahocorasick, but Kazu appears more matured robust to fuzzy matching particularly when terms overlap. So, I am thinking why re-invent the wheel if I can build on and extend Kazu for my local need.

raylite commented 9 months ago

Disabling some of the existing ontologies alone has one way of doing it that should be considerably simpler - with the downside that the string matching facilities of Kazu will still have the disabled ontology 'baked in' (which will take up memory, but shouldn't affect compute much), unless the model pack was rebuilt. Is this something you're interested in, or mainly the adding of additional ontologies, and therefore building a custom model pack?

At some point both I may need to do both. But I will give priority to adding custom kb.

EFord36 commented 9 months ago

Sounds good - in which case I think waiting for the new model pack and release is best.

Incidentally, Kazu actually uses pyahocorasick "under the hood" for its exact string matching in the MemoryEfficientStringMatchingStep, so it provides functionality on top of it.

EFord36 commented 8 months ago

To keep you in the loop, it's taken me a little longer this week to progress the next release, but we're making good progress, should be sometime next week. Sorry for the delay!

raylite commented 8 months ago

Thanks for the update.

On Fri, 9 Feb 2024, 09:22 Elliot Ford, @.***> wrote:

To keep you in the loop, it's taken me a little longer this week to progress the next release, but we're making good progress, should be sometime next week. Sorry for the delay!

— Reply to this email directly, view it on GitHub https://github.com/AstraZeneca/KAZU/issues/20#issuecomment-1935584096, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC44JD5MFLWLHEAXYOERERTYSXTGXAVCNFSM6AAAAABCWOPDUSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZVGU4DIMBZGY . You are receiving this because you authored the thread.Message ID: @.***>