asyml / ForteHealth

The project is in the incubation stage and still under development. ForteHealth is a flexible and powerful ML workflow builder for biomedical and clinical scenarios. This is part of the CASL project: http://casl-project.ai/
Apache License 2.0
10 stars 5 forks source link

Update the legends in Stave to show Disease, Medical, etc. #53

Open Leolty opened 2 years ago

Leolty commented 2 years ago

As mentioned in the meeting.

Leolty commented 2 years ago

Except Disease and Medical, what other annotations can we add?

BioBERT: https://github.com/dmis-lab/biobert

hunterhector commented 2 years ago

should be based on the ner_type we can predict, need to study the model outputs

Leolty commented 2 years ago

should be based on the ner_type we can predict, need to study the model outputs

Yeah. Here is the problem. In the config of this example, the ner_type is specified to Disease, so all the model outputs would be Disease, if I remove this configuration and run the pipeline, all the outputs will be labelled as "BioEntity", see the default configuration here (Line 235).

I could not find any instructions on how can I change the entity type to show different kinds of types, instead of all the Entities are labelled as "BioEntity".

Leolty commented 2 years ago

@hunterhector I think I detect the problem. In the following file:

https://github.com/asyml/forte-wrappers/blob/main/src/huggingface/fortex/huggingface/bio_ner_predictor.py

I check the source code for BioBERTProcessor, and I noticed that the relationship between Line 235 and Line 228 seems that do not make sense. It just labels all the type of entities as "BioEntity", and if I change the configuration to "DISEASE", all the type of entities will then be labelled as "DISEASE", and I can change whatever I want actually.

Here I just change the configuration to "APPLE", like this: ner_type: "APPLE". All the entities are labelled as "APPLE".

hunterhector commented 2 years ago
  1. We have more ner detectors, I think @Piyush13y knows where they are
  2. The adjustable label type is just a simple approach to change the label based on the model, it is not the best solution but kinda work now.
Leolty commented 2 years ago

got it. I remember I used to solve an issue to support bio ner using stanza, I will try that.

Leolty commented 2 years ago

I tried stanza, and the ner_type of the outputs are as follows:

we may change the Dieases, Medical to Test, Problem and Treatment.

hunterhector commented 2 years ago

I tried stanza, and the ner_type of the outputs are as follows:

  • TEST: oxygen saturation/ MRI of the head
  • PROBLEM: an underlying restrictive ventilatory defect/ hydrocephalus/ shift of the normal midline strictures
  • TREATMENT: Lexapro /sublingual nitroglycerin

we may change the Dieases, Medical to Test, Problem and Treatment.

Yeah, double check with @Piyush13y since I am sure we also have more spacy models

Piyush13y commented 2 years ago

Yes, we have more scispacy models that we can use and they give out different kinds of NER labels.

image

Ref: https://allenai.github.io/scispacy/

@Leolty I feel we can't just be changing the label type for the reason that I mentioned to you guys on the call. We want the users to see what they understand in the legend and not some NLP jargon. They wouldn't know what EntityMentions/MedicalEntityMentions mean. Also, adding more attributes (ner_type) to the same annotation will still require changes to the ontology file. Might as well create new annotations for each of the NER types for a smoother demo. At least, that's what I think, specially since it might not really take a lot more time than the adjustable label type approach.

Leolty commented 2 years ago

@hunterhector @Piyush13y I detected a bug here, related to Stave, I will elaborate here, which is quite tiny but stuck me for hours.

We have the json file here, like this, https://github.com/asyml/ForteHealth/blob/50_streamlit_to_stave/examples/search_engine_to_stave/default_onto_project.json

And in the code, we usually use this to create new project: session.create_project(project_json)

It can successfully create the project, but I can not open the documents in the project, it keeps loading. So I go over the .stave/db.sqlite3, and compare the ontology and config in the table stave_backend_project:

  1. I first found that, in the json file, Double Quotation Marks are used, however, in the database, they become Single Quotation Marks. ( I change it with SQL statement -- useless)
  2. Then, I carefully compared, found in the json file, the config uses true and false, however, when it stored in the database, it became True and False, but in json, we should use true and false. ( I change it with SQL statement -- works perfectly fine!) I think that's the point, the source code of create_project() should be modified.
hunterhector commented 2 years ago

@hunterhector @Piyush13y I detected a bug here, related to Stave, I will elaborate here, which is quite tiny but stuck me for hours.

We have the json file here, like this, https://github.com/asyml/ForteHealth/blob/50_streamlit_to_stave/examples/search_engine_to_stave/default_onto_project.json

And in the code, we usually use this to create new project: session.create_project(project_json)

It can successfully create the project, but I can not open the documents in the project, it keeps loading. So I go over the .stave/db.sqlite3, and compare the ontology and config in the table stave_backend_project:

  1. I first found that, in the json file, Double Quotation Marks are used, however, in the database, they become Single Quotation Marks. ( I change it with SQL statement -- useless)
  2. Then, I carefully compared, found in the json file, the config uses true and false, however, when it stored in the database, it became True and False, but in json, we should use true and false. ( I change it with SQL statement -- works perfectly fine!) I think that's the point, the source code of create_project() should be modified.

Hi, @Leolty. Thanks for exploring this and it seems like you find an interesting bug, and I believe it is related to this function. Would you mind creating the issues on Stave to discuss the bug?

Now the fix of the bug could be simple (fixing the quotation marks and case before storing the value to the database). But I am still wondering of the reasons and the best solution:

  1. Double vs Single quotation, you mentioned changing this does not fix the problem, I think that's because this is only part of the problem but this should also be fixed, right?
  2. "True" vs "true", similar to above, JSON spec requires "true". But when does the conversion go wrong for both cases? The json file we provided seems to be correct, and create_project simply sends the data via POST. IMO, the best solution would be to find out which conversion step causes this and we can find a principled solution from there. It is our last resort to post-fix the data inside the create_project function.
Leolty commented 2 years ago

@hunterhector @Piyush13y I detected a bug here, related to Stave, I will elaborate here, which is quite tiny but stuck me for hours. We have the json file here, like this, https://github.com/asyml/ForteHealth/blob/50_streamlit_to_stave/examples/search_engine_to_stave/default_onto_project.json And in the code, we usually use this to create new project: session.create_project(project_json) It can successfully create the project, but I can not open the documents in the project, it keeps loading. So I go over the .stave/db.sqlite3, and compare the ontology and config in the table stave_backend_project:

  1. I first found that, in the json file, Double Quotation Marks are used, however, in the database, they become Single Quotation Marks. ( I change it with SQL statement -- useless)
  2. Then, I carefully compared, found in the json file, the config uses true and false, however, when it stored in the database, it became True and False, but in json, we should use true and false. ( I change it with SQL statement -- works perfectly fine!) I think that's the point, the source code of create_project() should be modified.

Hi, @Leolty. Thanks for exploring this and it seems like you find an interesting bug, and I believe it is related to this function. Would you mind creating the issues on Stave to discuss the bug?

Now the fix of the bug could be simple (fixing the quotation marks and case before storing the value to the database). But I am still wondering of the reasons and the best solution:

  1. Double vs Single quotation, you mentioned changing this does not fix the problem, I think that's because this is only part of the problem but this should also be fixed, right?
  2. "True" vs "true", similar to above, JSON spec requires "true". But when does the conversion go wrong for both cases? The json file we provided seems to be correct, and create_project simply sends the data via POST. IMO, the best solution would be to find out which conversion step causes this and we can find a principled solution from there. It is our last resort to post-fix the data inside the create_project function.

Hi, @hunterhector. After check the function you sent me, I think I have known where the bug is. As you mentioned, create_project is correct and the json file is correct. The bug occurs when loading the json file.

In python, we usually use these functions to load a json file:

import json

file_obj = open(file_path)
project_json = json.load(file_obj)

create_project(project_json)

And I just made project_json as the input of the function create_project. project_json is a Dict, which results in the Single quotation and "True".

Actually, I just need to use the dump function to solve this bug, for example:

import json

file_obj = open(file_path)
project_json = json.load(file_obj)

create_project(json.dumps(project_json))

So I think there is no need to modify the source code. We just need to make sure the parameter of the funtion create_project‘is a string with json format (I mean Double quatation and "true" "false") instead of a Dict.

hunterhector commented 2 years ago

@hunterhector @Piyush13y I detected a bug here, related to Stave, I will elaborate here, which is quite tiny but stuck me for hours. We have the json file here, like this, https://github.com/asyml/ForteHealth/blob/50_streamlit_to_stave/examples/search_engine_to_stave/default_onto_project.json And in the code, we usually use this to create new project: session.create_project(project_json) It can successfully create the project, but I can not open the documents in the project, it keeps loading. So I go over the .stave/db.sqlite3, and compare the ontology and config in the table stave_backend_project:

  1. I first found that, in the json file, Double Quotation Marks are used, however, in the database, they become Single Quotation Marks. ( I change it with SQL statement -- useless)
  2. Then, I carefully compared, found in the json file, the config uses true and false, however, when it stored in the database, it became True and False, but in json, we should use true and false. ( I change it with SQL statement -- works perfectly fine!) I think that's the point, the source code of create_project() should be modified.

Hi, @Leolty. Thanks for exploring this and it seems like you find an interesting bug, and I believe it is related to this function. Would you mind creating the issues on Stave to discuss the bug? Now the fix of the bug could be simple (fixing the quotation marks and case before storing the value to the database). But I am still wondering of the reasons and the best solution:

  1. Double vs Single quotation, you mentioned changing this does not fix the problem, I think that's because this is only part of the problem but this should also be fixed, right?
  2. "True" vs "true", similar to above, JSON spec requires "true". But when does the conversion go wrong for both cases? The json file we provided seems to be correct, and create_project simply sends the data via POST. IMO, the best solution would be to find out which conversion step causes this and we can find a principled solution from there. It is our last resort to post-fix the data inside the create_project function.

Hi, @hunterhector. After check the function you sent me, I think I have known where the bug is. As you mentioned, create_project is correct and the json file is correct. The bug occurs when loading the json file.

In python, we usually use these functions to load a json file:

import json

file_obj = open(file_path)
project_json = json.load(file_obj)

create_project(project_json)

And I just made project_json as the input of the function create_project. project_json is a Dict, which results in the Single quotation and "True".

Actually, I just need to use the dump function to solve this bug, for example:

import json

file_obj = open(file_path)
project_json = json.load(file_obj)

create_project(json.dumps(project_json))

So I think there is no need to modify the source code. We just need to make sure the parameter of the funtion create_project‘is a string with json format (I mean Double quatation and "true" "false") instead of a Dict.

Sounds good, thanks!