CogStack / MedCAT

Medical Concept Annotation Tool
Other
431 stars 101 forks source link

```cat.cdb.print_stats()``` returns empty #464

Closed LWserenic closed 1 month ago

LWserenic commented 1 month ago

Hello @w-is-h, I am following the tutorial from MedCAT Tutorial and trying to implement the unsupervised train with my own data. When I run the cat.cdb.print_stats() it gives back nothing, just empty. Is it normal when using custom CDB and model from SpaCy or not? Maybe an extra question, is the training supposed to be very slow? Is there a way to speed up the training? Thank you

mart-r commented 1 month ago

In recent versions of MedCAT the method prints the stats through the logger. So if you want to see it in the stream output, you'd need to add a stream handler, or you could use the built in method to do so.

So you can do this manually, just for the CDB logger:

from medcat.cdb import logger as cdb_logger
import logging
cdb_logger.addHandler(logging.StreamHandler())

Or in general

from medcat import add_default_log_handlers
add_default_log_handlers()
LWserenic commented 1 month ago

Hello @mart-r, thank you so much for the answer. Maybe please update the MedCATTutorials because it still uses the cat.cdb.print_stats() function. Oh yeah maybe a bit of outside the topic of the title, what type of a model is made from the cat.create_model_pack() function?

mart-r commented 1 month ago

Hello @mart-r, thank you so much for the answer. Maybe please update the MedCATTutorials because it still uses the cat.cdb.print_stats() function.

The print_stats method is still the way to go at that point in the tutorials. And as much as we do have a tutorial about the logging, we need to make sure that it's either used in that section or (at the very least) mentioned where the method is used. And on top of that, it could make sense to return this result as well as logging it.

Oh yeah maybe a bit of outside the topic of the title, what type of a model is made from the cat.create_model_pack() function?

I'm not sure what the "type" is referring to. This method is used to save a MedCAT model pack. That is, once you've saved your model, you can load it again and reuse it.

LWserenic commented 1 month ago

Hi @mart-r , what I meant was the model type of the model created with that cat.create_model_pack() function like is it BiLSTM, or a Transformers model, etc., but I guess it was meant as an annotator right so I don't need MedCATTrainer to manually annotate things? Maybe because I'm still confused with the entirety of the MedCAT module as general (or more like NER+L in general) like what is the purpose of the unsupervised learning?

mart-r commented 1 month ago

Perhaps the create_model_pack name isn't the best at conveying what it does. It does not create anything new in the sense of creating a model. All it does is save the existing model on disk to the location you wish. So that you can load it back up and re-use it after having trained it.

Hi @mart-r , what I meant was the model type of the model created with that cat.create_model_pack() function like is it BiLSTM, or a Transformers model, etc.

So to answer this, it saves the MedCAT model. One that has (potentially) been trained to perform the NER+L task. The internals of the model can depend on the configuration. The MetaCATs included could be Bi-LSTM or BERT based. But not all models necessarily contain MetaCATs.

but I guess it was meant as an annotator right so I don't need MedCATTrainer to manually annotate things? Maybe because I'm still confused with the entirety of the MedCAT module as general (or more like NER+L in general) like what is the purpose of the unsupervised learning?

You do indeed need a MedCAT model to start annotating using MedCATtrainer. At the very least, the model provides MedCATtrainer the ontology / terminology used.

As for the rest, I would suggest you read the original paper if you haven't done so already: https://doi.org/10.1016/j.artmed.2021.102083 The reason and procedure to self-supervised training hasn't really changed.

LWserenic commented 1 month ago

Hello @mart-r, thank you for your response. After reading a bit from the journal I still a bit confused as to what is the difference between using the Meta Annotations inside MedCATtrainer and using the supervised learning from the cat.train_supervised_from_json() function. Is it fundamentally the same or you need to do both in the pipeline? I ask this related with the Fig. 2 and Fig. 3 of the journal as it shows that after the model is train unsupervised, or self-supervised, it goes straight to MedCATtrainer and it can already be used with the experts adding their annotation and the model able to adapt with the added annotation. Or is it after the model fine tuned it goes to another training with the cat.train_supervised_from_json()?

mart-r commented 1 month ago

Once a model has been trained without supervision, it will go to MedCATtrainer. In the trainer, there's generally many different projects where documents are annotated by different domain experts. But MedCATtrainer isn't really used as a trainer, but rather as an annotation tool. You can read more about MedCATtrainer on read the docs: https://medcattrainer.readthedocs.io/en/latest/

So once the expert annotations are done, the trainer export(s) is / are downloaded from MedCATtrainer and those can be used to fine-tune a model using the cat.train_supervised_from_json method.

Now, there are tools within MedCATtrainer that allow you to save the model state in the trainer, but that is generally not advised. See e.g: https://medcattrainer.readthedocs.io/en/latest/project_admin.html#save-models

LWserenic commented 1 month ago

So we don't train the model in the MedCATtrainer, but giving the model a new (or better) label to track (hence the supervised training comes from), I maybe wrong with this conclusion but I think that's how it works? Or the better word for it is the MedCATtrainer created the training dataset for the MedCAT model to supervised training it. After that (assuming it trained perfectly), the model can detect any unseen medical document and able to annotate it better.

mart-r commented 1 month ago

word for it is the MedCATtrainer created the training dataset for the MedCAT model to supervised training it. After that (assuming it trained perfectly), the model can detect any unseen medical document and able to annotate it better.

Yes, MedCATtrainer (generally) creates the dataset(s) that will be used to fine-tune using supervised training.

LWserenic commented 1 month ago

Okay thank you very much for the explanation. Sorry if I go off topic