Quickstart instruction issues

DanielSchauerTakeda commented 1 year ago

I was reading through the Quickstart documentation's Processing Your First Document section, but I ran into an issue.

setup steps from windows' command prompt:

cd C:\nlp-spacy-prodigy
mkdir kazu
cd C:\nlp-spacy-prodigy\kazu
conda create --name kazu python=3.8 pip
conda activate kazu
git clone https://github.com/AstraZeneca/KAZU.git
cd KAZU
pip install jupyterlab
pip install -U jupyter
pip install -e .[all]
jupyter-lab

Attempting to run a modified version of the sample script (due to not having admin rights to set an environment variable), does not work:

from hydra import initialize_config_dir, compose
from hydra.utils import instantiate
from kazu.data.data import Document
from kazu.pipeline import Pipeline
from pathlib import Path
import os

# the hydra config is kept in the model pack. Ensure this env
# variable is set to your model pack location
cdir = Path("C:\nlp-spacy-prodigy\kazu\KAZU\kazu_model_pack_public-v0.0.16").joinpath('conf')
with initialize_config_dir(config_dir=str("C:\nlp-spacy-prodigy\kazu\KAZU\kazu_model_pack_public-v0.0.16\conf")):
    cfg = compose(
        config_name="config",
        overrides=[],
    )
    pipeline: Pipeline = instantiate(cfg.Pipeline)
    text = "EGFR mutations are often implicated in lung cancer"
    doc = Document.create_simple_document(text)
    pipeline([doc])
    print(f"{doc.sections[0].text}")

results from that script:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 4
      2 from hydra.utils import instantiate
      3 from kazu.data.data import Document
----> 4 from kazu.pipeline import Pipeline
      5 from pathlib import Path
      6 import os

File C:\nlp-spacy-prodigy\kazu\KAZU\kazu\pipeline\__init__.py:1
----> 1 from kazu.pipeline.pipeline import (
      2     load_steps_and_log_memory_usage,
      3     Pipeline,
      4     FailedDocsHandler,
      5     FailedDocsFileHandler,
      6     FailedDocsLogHandler,
      7 )

File C:\nlp-spacy-prodigy\kazu\KAZU\kazu\pipeline\pipeline.py:13
     10 from omegaconf import DictConfig
     12 from kazu.data.data import Document, PROCESSING_EXCEPTION
---> 13 from kazu.steps import Step
     14 from datetime import datetime
     16 logger = logging.getLogger(__name__)

File C:\nlp-spacy-prodigy\kazu\KAZU\kazu\steps\__init__.py:3
      1 from kazu.steps.step import Step, document_iterating_step, document_batch_step
      2 from kazu.steps.document_post_processing.abbreviation_finder import AbbreviationFinderStep
----> 3 from kazu.steps.joint_ner_and_linking.explosion import ExplosionStringMatchingStep
      4 from kazu.steps.linking.dictionary import DictionaryEntityLinkingStep
      5 from kazu.steps.linking.sapbert import SapBertForEntityLinkingStep

File C:\nlp-spacy-prodigy\kazu\KAZU\kazu\steps\joint_ner_and_linking\explosion.py:13
      5 from kazu.data.data import (
      6     CharSpan,
      7     Document,
   (...)
     10     SynonymTermWithMetrics,
     11 )
     12 from kazu.modelling.database.in_memory_db import SynonymDatabase
---> 13 from kazu.modelling.ontology_matching.ontology_matcher import OntologyMatcher
     14 from kazu.steps import Step, document_batch_step
     15 from kazu.utils.utils import PathLike

File C:\nlp-spacy-prodigy\kazu\KAZU\kazu\modelling\ontology_matching\ontology_matcher.py:15
     12 import srsly
     14 from kazu.data.data import SynonymTerm
---> 15 from kazu.modelling.ontology_preprocessing.base import OntologyParser
     16 from kazu.utils.grouping import sort_then_group
     17 from kazu.utils.utils import PathLike

File C:\nlp-spacy-prodigy\kazu\KAZU\kazu\modelling\ontology_preprocessing\base.py:7
      5 import sqlite3
      6 from abc import ABC
----> 7 from functools import cache
      8 from pathlib import Path
      9 from typing import List, Tuple, Dict, Any, Iterable, Set, Optional, FrozenSet, Union

ImportError: cannot import name 'cache' from 'functools' (C:\Users\ytt2404\.conda\envs\kazu\lib\functools.py)

wonjininfo commented 1 year ago

Dear Daniel Schauer,

Thank you very much for expressing interest in the KAZU framework!!

The "cache" feature in "functools" was introduced in Python 3.9 and above (https://docs.python.org/3.9/library/functools.html#functools.cache). An ImportError may occur if the python version is below 3.9.

Would you try once again with a newer version of python and environment?

I apologize for the missing python version information in the Readme. I will promptly request an update to include this information. I will also discuss whether we can use alternative functions that are compatible with lower versions.

Best regards, WonJin

DanielSchauerTakeda commented 1 year ago

@wonjininfo thanks for the reply and following up. I tried again, this time specifying python=3.9 when creating my anaconda environment but I got an error because there is no OS environment variable named JAVA_HOME on my computer.

Since I'm working in an enterprise environment, I will need to reach out to my IT desk to get that setup.

I'd think that the quick start installation instructions for Kazu should either call out the need for this OS environment variable, or alternatively if the underlying problem is that Kazu expects Java SDK to be installed.

I can run code like this:

from hydra import initialize_config_dir, compose
from hydra.utils import instantiate
from kazu.data.data import Document
from pathlib import Path

cdir = Path("C:/nlp-spacy-prodigy/KAZU/kazu_model_pack_public-v0.0.16").joinpath('conf')
with initialize_config_dir(config_dir=str(cdir)):
    cfg = compose(
        config_name="config",
        overrides=[],
    )
    text = "EGFR mutations are often implicated in lung cancer. Epidermal Growth Factor Receptor (EGFR) is a gene."
    doc = Document.create_simple_document(text)
    print(f"{doc.sections[0].text}")
    #>>EGFR mutations are often implicated in lung cancer. Epidermal Growth Factor Receptor (EGFR) is a gene.

but any code that instantiates a kazu pipeline throw the error mentioned before:

from hydra import initialize_config_dir, compose
from hydra.utils import instantiate
from kazu.data.data import Document
from kazu.pipeline import Pipeline
from pathlib import Path

# the hydra config is kept in the model pack
# get the model pack from kazu's release page https://github.com/astrazeneca/kazu/releases, then unzip to the working folder
cdir = Path("C:/nlp-spacy-prodigy/KAZU/kazu_model_pack_public-v0.0.16").joinpath('conf')
print(cdir)
with initialize_config_dir(config_dir=str(cdir)):
    cfg = compose(
        config_name="config",
        overrides=[],
    )
    pipeline: Pipeline = instantiate(cfg.Pipeline)
    text = "EGFR mutations are often implicated in lung cancer"
    doc = Document.create_simple_document(text)
    pipeline([doc])
    print(f"{doc.sections[0].text}")

EFord36 commented 1 year ago

Hi Daniel,

Thanks for your patience here - yes, we should be calling out that the default pipeline expects a Java SDK installation. Sorry for that, and we'll work on the best way to do that.

Actually, it's only running a single 'step' in the pipeline, the 'SethStep' which recognises Gene Mutations, that depends on Java. So removing this step from the pipeline should let you try out the rest of Kazu. Inserting the line del cfg.Pipeline.steps[5] after the config is created but before the pipeline is loaded will do this:

from hydra import initialize_config_dir, compose
from hydra.utils import instantiate

from kazu.data.data import Document
from kazu.pipeline import Pipeline
from pathlib import Path
import os

# the hydra config is kept in the model pack
cdir = Path(os.environ["KAZU_MODEL_PACK"]).joinpath('conf')  
with initialize_config_dir(config_dir=str(cdir)):
    cfg = compose(
        config_name="config",
        overrides=[],
    )
    ### NEW KEY LINE ###
    del cfg.Pipeline.steps[5]
    pipeline: Pipeline = instantiate(cfg.Pipeline)
    text = "EGFR mutations are often implicated in lung cancer"
    doc = Document.create_simple_document(text)
    pipeline([doc])
    print(f"{doc.get_entities()}")

Try running this instead of the code in the quickstart - I've tried it (with JAVA_HOME not set) and it works for me. I've also checked there weren't any other KAZU-related environment variables when running it, and checked the environment variables we use elsewhere in the config for KAZU.

EFord36 commented 1 year ago

Hi Daniel,

How are you getting on? Did the workaround above work for you?

We've also just released v0.0.24, which removes the 'SethStep' from the default pipeline, so you will now be able to use it without having JAVA_HOME set.

We've also added the requirement of python 3.9 in the installation instructions as well as the project metadata in pyproject.toml (which shows up in the ‘Meta’ section on pypi.org). Thanks again for letting us know about these issues and sorry for the pain you suffered with them.

EFord36 commented 1 year ago

Closing this issue because we've updated Kazu to address the two key problems:

Python 3.9 or higher was required, and this wasn't mentioned before. We now have this in the README, the 'quickstart' page of the docs, and the PyPI metadata for KAZU.
A JAVA_HOME was previously required to run the default KAZU pipeline, but this wasn't documented. We have now removed the step that imposed this requirement from the default pipeline, so it is no longer necessary.

Please do re-open/open new issues though if you suffer from other problems though!

AstraZeneca / KAZU

Quickstart instruction issues #1