CogStack / MedCAT

Medical Concept Annotation Tool
Other
454 stars 105 forks source link

CU-8696nbm9j: Add module to convert vocab vectors #504

Closed mart-r closed 3 hours ago

mart-r commented 1 week ago

Adds a module to convert the vocab vectors from the default (or really anything) to a smaller length.

The default vocab vector length is 300. However, we don't really make use of all this information. Experiments show that we can go quite a lot smaller in vocab size and retain the same performance. See e.g: https://gist.github.com/mart-r/e9db909cde1922464bcc753f54006994 Or (somewhat more comprehensively): https://gist.github.com/mart-r/21460286466d17b9f23719ba3f4dc938

The benefits of using a smaller vocab size mainly boil down to (examples at 50 vector size):

NOTE: There might be improvements we could do here:

tomolopolis commented 1 week ago

Task linked: CU-8696nbm9j Support changing of vocab vector size