foxlf823 / sodner

53 stars 6 forks source link

Ner or Rel in jssonet #10

Open shashank140195 opened 1 year ago

shashank140195 commented 1 year ago

Hi, can you please explain what is the target: "rel" in the config files? I mean what is the difference if the target is set to "ner" and "rel"?

foxlf823 commented 1 year ago

Hello Shashank, you can find "display_metrics: display_metrics[p.target]," in template.libsonnet, and "display_metrics" is used in the function "get_metrics" in sodner.py to control how to show evaluation metrics. If you set target as "ner", only the results of identifying overlapped entities will be printed. But If you set target as "rel", the results of both overlapped and discontinuous entities will be printed. This is because identifying overlapped entities doesn't need the relation extraction module. Therefore, "ner" can be used in the dataset that doesn't contain discontinuous entities like ACE05, while "rel" can be used in the dataset that contains discontinuous entities like CADEC.

shashank140195 commented 1 year ago

Thank you for the information. It was very useful. Though I am running into some other issues: I have provided the path for the BERT model and getting the below error. I am running code on Google Colab.

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp1pstljwv/bert_config.json'

Screenshot 2023-02-16 at 4 55 02 PM

I see that Bert config and model are being extracted to some temporary path to which code is not able to access.

Screenshot 2023-02-16 at 4 34 54 PM

PS: I am using "PubMedBERT". Can you tell me how and where to change the path where the code is unarchiving and extracting the model and BERT config?

foxlf823 commented 1 year ago

I guess you have used the bert checkpoint that is not allennlp format. Please refer to the following page https://github.com/allenai/scibert As you can see, there are several different formats for scibert. You can download one of "PyTorch AllenNLP Models" and it should work.

shashank140195 commented 1 year ago

Thank you for the information once again. But I want to use PubMedBERT which is available on the huggingface (https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext). How can I make its checkpoint with the allennlp format?

foxlf823 commented 1 year ago

After you have downloaded BiomedNLP, you will see the files like config.json, pytorch_model.bin, vocab.txt, etc. If you have downloaded scibert, you will see the files like vocab.txt and weights.tar.gz. You can unzip the weights.tar.gz and you will see bert_config.json and pytorch_model.bin. Therefore, it is not hard to make the checkpoint with the allennlp format. Follow the steps below: 1) Rename the config.json file in the BiomedNLP as bert_config.json. 2) Zip the bert_config.json and pytorch_model.bin as weights.tar.gz. 3) Put vocab.txt and weights.tar.gz into one directory. All set!

shashank140195 commented 1 year ago

Thank you, it was really helpful.

In the following screenshot, ner precision/recall/f1 is for overlapped, and rel precision/recall/f1 is for discontinuous entities, but I am not sure about real_ner. does real_ner combines overlap + discontinuous + flat or it is just flat?

Screenshot 2023-03-08 at 7 34 28 PM
foxlf823 commented 1 year ago

Because the entities in our task may be overlapped or discontinuous, including more than one text spans, ner precision/recall/f1 are the performances of detecting text spans, rel precision/recall/f1 are the performances of detecting the relations between spans. They can be considered as intermediate results.

After text spans and their relations are determined, we use these to compose flat/overlapped/discontinuous entities. Therefore, you can consider realner* as the final NER performances, including overlap + discontinuous + flat entities.

shashank140195 commented 1 year ago

Thank you for all the explanations so far. What I have observed is that SODNER does not create the prediction file in the serialization directory. How can I enable it? I just see the performance metrics JSON file. But what if I want to see the predictions made by the model on the test dataset in some output file? How can we achieve this? I see no argument to turn this on/off. Can you help me with this?

foxlf823 commented 1 year ago

Hi,

The file "my_predictor.py" in the directory "sodner/predictors" can meet your requirement. The command to use this is as follows: allennlp predict path_of_your_model.tar.gz path_of_your_data.json --include-package sodner --predictor my_predictor --output-file path_of_your_prediction.txt

On Thu, May 11, 2023 at 4:29 AM Shashank Gupta @.***> wrote:

Thank you for all the explanations so far. What I have observed is that SODNER does not create the prediction file in the serialization directory. How can I enable it? I just see the performance metrics JSON file. But what if I want to see the predictions made by the model in some output file? Can you help me with this?

— Reply to this email directly, view it on GitHub https://github.com/foxlf823/sodner/issues/10#issuecomment-1542770050, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBLKX6Y65OHDWEJB4KGJ63XFP3EJANCNFSM6AAAAAAU4JFOCM . You are receiving this because you commented.Message ID: @.***>

shashank140195 commented 1 year ago

Hi, When I pre process my dataset, I do not see any information in "dep" key. Like i see your pre process data, they have information in "dep" and the "node" information. How is this information is useful and why this information is not included in my preprocessing? Do you please explain the purpose?

foxlf823 commented 1 year ago

"dep" means dependency and "nodes" means the nodes (tokens) in the dependency tree. Each node corresponds to a token and each node contains a list of key-value pairs. The key is the index of the token and the value is the indices of its adjacent tokens in the dependency tree. You can refer to ie_json.py to see how these data are read.

On Sat, May 20, 2023 at 11:36 PM Shashank Gupta @.***> wrote:

Hi, When I pre process my dataset, I do not see any information in "dep" key. Like i see your pre process data, they have information in "dep" and the "node" information. How is this information is useful and why this information is not included in my preprocessing? Do you please explain the purpose?

— Reply to this email directly, view it on GitHub https://github.com/foxlf823/sodner/issues/10#issuecomment-1555937299, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBLKX6SD3R7NKK3HIYWWCLXHDQHJANCNFSM6AAAAAAU4JFOCM . You are receiving this because you commented.Message ID: @.***>