jenniferjiangkells commented 2 months ago

Description

Addresses #54

Introduces the concept of Pipeline, Component, and DataContainers.

These are the building blocks of the pipelineing component of HC. We can give the users 3 levels of control:

Build your own pipeline, using inline functions - this is the easiest and most flexible level, good for quick experiments
Build your own pipeline, using Component classes - this adds an extra layer of abstraction, especially useful for wrapping specific models such as MedCAT, ClinicalBERT, LLMs.
Use prebuilt pipelines e.g. MedicalCodingPipeline - prebuilt pipelines are a pre-configured set of components for specific use cases and has the highest level of abstraction. This is the easiest to get up and running with something functional.

Extra - loading pipeline integrations from other libraries such as spacy, huggingface, etc. #27

Went over-scope a little and additionally implemented TextPreprocessor, Model, TextPostprocessor, and MedicalCodingPipeline, which are implementations of Component and Pipeline. Helps to see what I want the downstream usage to look like.

It's probably best to introduce new concepts through examples, so here's a code snippet:

from healthchain.io.containers import Document
from healthchain.pipeline import Pipeline
from healthchain.pipeline.components import Model
from healthchain.pipeline.components import TextPostProcessor
from healthchain.pipeline.components import TextPreprocessor
from healthchain.pipeline import MedicalCodingPipeline

################################################################################
# 1. Build your own pipeline, using inline functions
################################################################################

# initialise the pipeline with the data type you want to process
nlp_pipeline = Pipeline[Document]()

@nlp_pipeline.add(stage="preprocessing")
def tokenize(doc: Document) -> Document:
    doc.tokens = doc.text.split()
    return doc

@nlp_pipeline.add(stage="preprocessing", dependencies=["tokenize"])
def pos_tag(doc: Document) -> Document:
    # Dummy POS tagging
    doc.pos_tags = ["NOUN" if token[0].isupper() else "VERB" for token in doc.tokens]
    return doc

@nlp_pipeline.add(dependencies=["tokenize", "pos_tag"])
def ner(doc: Document) -> Document:
    # Dummy NER
    doc.entities = [
        token for token, pos in zip(doc.tokens, doc.pos_tags) if pos == "NOUN"
    ]
    return doc

print("Initial pipeline:")
print(nlp_pipeline)
print(nlp_pipeline.stages)

@nlp_pipeline.add(position="after", reference="tokenize")
def remove_stopwords(doc: Document) -> Document:
    stopwords = {"the", "a", "an", "in", "on", "at"}
    doc.tokens = [token for token in doc.tokens if token not in stopwords]
    return doc

print("After adding remove_stopwords:")
print(nlp_pipeline)
print(nlp_pipeline.stages)

# Remove method
def new_tokenizer(doc: Document) -> Document:
    doc.tokens = doc.text.split() + ["<EOS>"]  # Add end-of-sentence token
    return doc

nlp_pipeline.remove("tokenize")
nlp_pipeline.add(new_tokenizer, name="tokenize", position="first")

# Replace method
def advanced_ner(doc: Document) -> Document:
    # More sophisticated NER logic
    doc.entities = [
        token for token in doc.tokens if token[0].isupper() and len(token) > 1
    ]
    return doc

nlp_pipeline.replace("ner", advanced_ner)

print("After replacing ner:")
print(nlp_pipeline)
print(nlp_pipeline.stages)

# Usage
# NLP pipeline
nlp = nlp_pipeline.build()

doc = Document("OpenAI released GPT-4 in 2023.")

result = nlp(doc)
print(f"Char count: {doc.char_count()}")
print(f"Word count: {doc.word_count()}")
print(f"Tokens: {result.tokens}")
print(f"POS Tags: {result.pos_tags}")
print(f"NER: {result.get_entities()}")

preprocessing_components = nlp_pipeline._stages.get("preprocessing", [])
print(f"Preprocessing components: {[c.__name__ for c in preprocessing_components]}")

################################################################################
# 2. Build your own pipeline, using Component classes (or mix and match)
################################################################################
component_pipeline = Pipeline[Document]()

component_pipeline.add(TextPreprocessor())
component_pipeline.add(Model(model_path="path/to/model"))
component_pipeline.add(TextPostProcessor())
component_pipeline.add(remove_stopwords, position="last")

# Or this is how you would configure it, not sure about adding an extra config, seems a bit clunky, might remove
# postprocessor_config = TextPostProcessorConfig(
#     postcoordination_lookup={
#         "heart attack": "myocardial infarction",
#         "high blood pressure": "hypertension"
#     }
# )
# component_pipeline.add(TextPostProcessor(postprocessor_config))

components = component_pipeline.build()
result = components(doc)

print(component_pipeline)
print(component_pipeline.stages)
print(f"Tokens: {result.tokens}")
print(f"POS Tags: {result.pos_tags}")
print(f"NER: {result.entities}")

################################################################################
# 3. Use prebuilt pipelines e.g. MedicalCodingPipeline
################################################################################
pipeline = MedicalCodingPipeline.load("./path/to/model")

coding_pipeline = pipeline.build()
result = coding_pipeline(doc)

print(pipeline)
print(pipeline.stages)
print(f"Processed Text: {result.text}")
print(f"Tokens: {result.tokens}")
print(f"Entities: {result.entities}")

jenniferjiangkells commented 1 month ago

@adamkells dw im going to explain everything ☝️

jenniferjiangkells commented 1 month ago

@adamkells addressed all your comments, don't think you need to go over the code but feel free to check the documentation as I made quite a lot of changes there

jenniferjiangkells commented 1 month ago

@adamkells my man can i just merge this

adamkells commented 1 month ago

Oh go on then

dotimplement / HealthChain

Add pipeline framework #61

Description