dmis-lab / KAZU-NER-module

EMNLP 2022: Biomedical NER for the Enterprise with Distillated BERN2 and the Kazu Framework
8 stars 2 forks source link

Error in How to eval KAZU-NER model #1

Open Kik099 opened 2 months ago

Kik099 commented 2 months ago

I got this error "FileNotFoundError: Unable to find '\Users\kkiko\KAZU-NER-exp\BC5CDR_test\dev.prob_conll' at C:\Users\kkiko\KAZU-NER-exp\BC5CDR_test\prob_conll".

I found that following the steps in the README.md does not create the file dev.prob_conll.

What do i need to do?

wonjininfo commented 2 months ago

Hi @Kik099

Thank you for your interest in our work.

The label2prob.py script is designed to create any splits: train.prob_conll, dev.prob_conll, and test.prob_conll files.

But I just found out that I only wrote the sample scripts for the test split. They can be applied to all splits.

export DATA_DIR=${HOME}/KAZU-NER-exp/BC5CDR_test # Please use the absolute path to avoid unexpected errors 

If you have set DATA_DIR, you can run the following commands:

ls ${DATA_DIR}

This should display the train.tsv, dev.tsv, and test.tsv files.

Please run the following code for each of the train.tsv, dev.tsv, and test.tsv splits:

export IS_IO="" # Set this if you are using IO tagging.

python label2prob.py --label ${DATA_DIR}/labels.txt --file_path ${DATA_DIR}/test.tsv --output_path ${DATA_DIR}/test.prob_conll ${IS_IO}
python label2prob.py --label ${DATA_DIR}/labels.txt --file_path ${DATA_DIR}/train.tsv --output_path ${DATA_DIR}/train.prob_conll ${IS_IO}
python label2prob.py --label ${DATA_DIR}/labels.txt --file_path ${DATA_DIR}/dev.tsv --output_path ${DATA_DIR}/dev.prob_conll ${IS_IO}

Please let me know if this does not work. I will update Readme too.

Thanks, WonJin

Kik099 commented 2 months ago

Hi @wonjininfo

With what you said it worked for that error. Like you can see it started running:

Running tokenizer on prediction dataset #0: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.10ba/s] Running tokenizer on prediction dataset #3: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.61ba/s] Running tokenizer on prediction dataset #1: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.23ba/s] Running tokenizer on prediction dataset #2: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.24ba/s] Running tokenizer on prediction dataset #6: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.57ba/s] Running tokenizer on prediction dataset #5: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.24ba/s] Running tokenizer on prediction dataset #4: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.14ba/s] Running tokenizer on prediction dataset #7: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.33ba/s]

But then a few minutes later a error appeared. I put in the text file the output, soo it is easy to see.

Do you know what i need to do?

error_file.json

And thanks for your help

wonjininfo commented 2 months ago

Hi @Kik099

Thank you for sharing the log file.

I found it a bit challenging to trace the issue at the moment, but I suspect it may be related to the number of labels. Let's start with the simplest approach. If you’re only trying to evaluate the model, and not training it, we can copy the test.prob_conll file and use it to create train.prob_conll and dev.prob_conll. This will give us three identical files with different names. Please try running the process again with these files.

Additionally, could you please share the exact command line you used in the shell and the version of transformers you're working with? This will help me to replicate and trace the error.

If you’re uncomfortable sharing this information here (as this is a publicly open space), feel free to email me by wonjin.info (at) gmail.com

Kik099 commented 2 months ago

Hi @wonjininfo,

I followed your instructions but encountered the same error when I have three identical files with different names.

Regarding the libraries I have installed, here is the list:

datasets 1.18.3 pypi_0 pypi torch 1.8.1 pypi_0 pypi transformers 4.30.2 pypi_0 pypi seqeval 1.2.2 pypi_0 pypi accelerate 0.20.3 pypi_0 pypi aiohttp 3.8.6 pypi_0 pypi aiosignal 1.3.1 pypi_0 pypi async-timeout 4.0.3 pypi_0 pypi asynctest 0.13.0 pypi_0 pypi attrs 24.2.0 pypi_0 pypi ca-certificates 2024.7.2 haa95532_0 certifi 2024.7.4 pypi_0 pypi charset-normalizer 3.3.2 pypi_0 pypi colorama 0.4.6 pypi_0 pypi dill 0.3.7 pypi_0 pypi filelock 3.12.2 pypi_0 pypi frozenlist 1.3.3 pypi_0 pypi fsspec 2023.1.0 pypi_0 pypi huggingface-hub 0.16.4 pypi_0 pypi idna 3.7 pypi_0 pypi importlib-metadata 6.7.0 pypi_0 pypi joblib 1.3.2 pypi_0 pypi multidict 6.0.5 pypi_0 pypi multiprocess 0.70.15 pypi_0 pypi numpy 1.21.6 pypi_0 pypi openssl 1.1.1w h2bbff1b_0 packaging 24.0 pypi_0 pypi pandas 1.3.5 pypi_0 pypi pip 22.3.1 py37haa95532_0 psutil 6.0.0 pypi_0 pypi pyarrow 12.0.1 pypi_0 pypi python 3.7.13 h6244533_1 python-dateutil 2.9.0.post0 pypi_0 pypi pytz 2024.1 pypi_0 pypi pyyaml 6.0.1 pypi_0 pypi regex 2024.4.16 pypi_0 pypi requests 2.31.0 pypi_0 pypi safetensors 0.4.4 pypi_0 pypi scikit-learn 1.0.2 pypi_0 pypi scipy 1.7.3 pypi_0 pypi setuptools 65.6.3 py37haa95532_0 six 1.16.0 pypi_0 pypi sqlite 3.45.3 h2bbff1b_0 threadpoolctl 3.1.0 pypi_0 pypi tokenizers 0.13.3 pypi_0 pypi tqdm 4.66.5 pypi_0 pypi typing-extensions 4.7.1 pypi_0 pypi urllib3 2.0.7 pypi_0 pypi vc 14.40 h2eaa2aa_0 vs2015_runtime 14.40.33807 h98bb1dd_0 wget 3.2 pypi_0 pypi wheel 0.38.4 py37haa95532_0 wincertstore 0.2 py37haa95532_2 xxhash 3.5.0 pypi_0 pypi yarl 1.9.4 pypi_0 pypi zipp 3.15.0 pypi_0 pypi

When I attempted to install the following libraries:

torch==1.8.2 transformers==4.9.2 datasets==1.18.3 seqeval>=1.2.2 I encountered an error indicating that the installation of transformers==4.9.2 and datasets==1.18.3 led to a conflict.

Here’s the error message:

pip install torch==1.8.2 transformers==4.9.2 datasets==1.18.3 seqeval>=1.2.2 ERROR: Could not find a version that satisfies the requirement torch==1.8.2 (from versions: 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1) ERROR: No matching distribution found for torch==1.8.2

To resolve this, I installed torch==1.8.1 instead. However, this led to another error:

pip install torch==1.8.1 transformers==4.9.2 datasets==1.18.3 seqeval>=1.2.2 ERROR: Cannot install datasets==1.18.3 and transformers==4.9.2 because these package versions have conflicting dependencies. ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

Please let me know how I can resolve this issue.

Best regards, Rodrigo Saraiva

wonjininfo commented 2 months ago

Thanks for sharing. I will work on that and get back to you soon.

Kik099 commented 2 months ago

I just have another question, i already i trained the model. How do i run it know? I am developing a thesis and i will talk about your article there.

I have the following files that are in the dir "_tmp\output\MultiLabelNER-test":

all_results.json config.json folder- multi_label_seq_eval pytorch_model.bin special_tokens_map.json tokenizer.json tokenizer_config.json trainer_state.json training_args.bin train_results.json vocab.txt

wonjininfo commented 2 months ago

Hi @Kik099 ,

i already i trained the model.

Does this mean that the previous error is no longer occurring?

Regarding your other question:

How do i run it know?

Could you please clarify what you mean? I noticed there might be some typos, so I want to make sure I understand your question correctly.

Kik099 commented 2 months ago

Hi @wonjininfo

Does this mean that the previous error is no longer occurring?

Unfortunately, the error is still appearing during the evaluation phase.

Could you please clarify what you mean?

Additionally, after training the model, I would like to use it for token classification. How can I input text to the model to obtain token classifications?

Kik099 commented 2 months ago

Hi @wonjininfo Do you understood what I explained ?

Best,

Rodrigo Saraiva

wonjininfo commented 2 months ago

Hi Rodrigo ,

I spent a few hours resolving the dependency issues and identified some points that needed updating because they no longer worked. I have updated them in the README.


Recommended Solution: Using Python v3.7.13 is suggested for compatibility.

Install pip install transformers==4.10.3 datasets==1.18.3 seqeval==1.2.2


If you use higher version of python, we need another version of tokenizers.

If you encounter the error error: can't find Rust compiler while installing transformers, this may related to your python version. As noted in this comment, older version of tokenizers are not compatible with newer version of python.


Alternative Solution (Not Recommended): Alternatively, you can use Python v3.10.12 and install the libraries using the following command:

# Tested torch version: torch==2.1.0 CUDA 12.1
#pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

pip install transformers==4.16.2 tokenizers==0.12 datasets==1.18.3 seqeval==1.2.2

After that, I followed my README and, for testing purposes, copied the development file to the test file using cp ${DATA_DIR}/test.prob_conll ${DATA_DIR}/dev.prob_conll. This let me run the code without any errors, but unfortunately, I couldn't reproduce your issue. I tested it on Ubuntu 22.04.4 LTS with Python 3.10.12. The problem might be due to differences in the OS, but I'm not entirely sure.

wonjininfo commented 2 months ago

How can I input text to the model to obtain token classifications?

To input text into the model for token classification, you need to preprocess your text to match the format of the test.prob_conll file.

  1. Preprocess Your Text:

    • Ensure your text has the same format as the test.prob_conll file.
    • Use this file as a replacement for --test_file ${DATA_DIR}/test.prob_conll.
  2. Specify Your Model:

    • If you are not using our pre-trained model ("dmis-lab/KAZU-NER-module-distil-v1.0"), set the location of your model using the $BERT_MODEL variable:
      --model_name_or_path $BERT_MODEL
  3. Update Labels if Needed:

    • If your model is trained on a different set of entities, make sure to update ${DATA_DIR}/labels.txt accordingly.
Kik099 commented 2 months ago

Hi @wonjininfo

Thank you so much for your reply—I really appreciate it.

To summarize, it seems that there was a demo website for testing KAZU (http://kazu.korea.ac.kr/), which is currently not working. On that website, we could input a simple phrase and obtain the token classifications.

To achieve this in the KAZU project (https://github.com/AstraZeneca/KAZU), we can run the following code:

def kazu_test(cfg): pipeline: Pipeline = instantiate(cfg.Pipeline) text = "EGFR mutations are often implicated in lung cancer" doc = Document.create_simple_document(text) pipeline([doc]) print(f"{doc.get_entities()}")

Running this script produces the token classifications for the words in the text.

Are you saying that to achieve this in this project, we need to convert the phrase to the test.prob_conll format?

If so, which code should I run—the evaluation code?

wonjininfo commented 2 months ago

For the demo website, it is currently managed by my former colleagues, who are postgraduate students at Korea University. I have asked them to reboot the server.

Regarding the KAZU project, that’s a good point—I was primarily focused on this repository. You can certainly use the KAZU repository (https://github.com/AstraZeneca/KAZU), but please note that KAZU is designed for industrial use. It includes additional matching algorithms using ontologies and various features, including preprocessing (from plain text to final output in JSON format). Some of these features might not be suitable for other domains (i.e. non-biomedical/clinical domain), and removing them could be challenging due to the large codebase.

In contrast, this repository focuses exclusively on the core module, emphasizing the neural model aspect of NER recognition (without linking). It is more academically oriented and does not offer end-to-end processing from plain text to final output, so users will need to manage preprocessing and post-processing themselves.

Our label2prob.py script provides a conversion from CoNLL format to our prob_conll format. Once you have your data in CoNLL format, you can use this script to convert it into the required input format. For converting plain text to CoNLL format, other researchers might have shared scripts online, but we did not include such scripts here as our full pipeline is intended to be used with the KAZU repository.

Kik099 commented 2 months ago

Hi @wonjininfo

So you are saying that I cannot use this model to predict plain texts, to do that I need to have the plain text in the format of.prob_conll?

If this is the case how can I predict that plain text? Do I need to put all tokens values to 0.0 ?

Or did I understood wrong?

wonjininfo commented 2 months ago

You can use this model with the training and evaluation codes to predict any text, but you'll need to write or find some preprocessing code to convert plain text into a CoNLL-like format.

So, the short answer is no—you can't use it as-is. You'll need to write a few dozen lines of code to get it working. I haven't used these myself, but you might find these resources useful: spacy-conll or this Stack Overflow answer. Still, a few tweaks are required.

Kik099 commented 1 week ago

hi @wonjininfo

I have trained the model. Can i now run the evaluation in the trained model? If yes how? Can i change the bert model to the folder '_tmp/output/MultiLabelNER-test'??

Will this be enough? I did that and this error appeared errorEval.json