Description

Implement MedicalCodingPipeline and SummarizationPipeline

Related Issue

55

Changes Made

I come, once again, bearing breaking changes.

💥 Changes to Document container class: ordered by sub-containers nlp, concepts, hl7, cds, models for better organisation. Each attribute is in charge of handling specific data handling, usually via getter and setter functions.
- Changed .add_huggingface_output() etc to .add_output(integration_name, task, output) - easier to access and manage
- Added models.get_generated_text() method,
Changes to CcdData: uses a ConceptLists dataclass to contain problems, medications, allergies concepts for better interface with the Document class.
Changes to .load() method for BasePipeline: this method now configures the pipeline with additional logic that parses a model and model source (either string - name of model or path to model or a callable - langchain chain object) into a ModelConfig object.
Added ModelRouter, a helper which returns the appropriate integration component given a ModelConfig
Templates: Users can pass in a Jinja template for custom CDS cards (this will extend to CDAs too, but that's a matter for a different issue).
Added CdsCardCreator: this component either extracts generated text from model outputs in the pipeline or takes in specified static content and parses this into a CDS Card object using Jinja templates (a default is used if not provided).
Renamed integration components to be more descriptive: SpacyComponent -> SpacyNLP, HuggingFaceComponent -> HFTransformer, LangchainComponent -> LangChainLLM
- Also pass kwargs to integration components
Added ._add_concepts_to_hc_doc() helper method to SpacyNLP, which takes the entities from the the spacy doc and parses it to Concept and adds it to the .concepts attribute in Document. This is hard coded to always add new concepts as SNOMED Problems for now, but will be made configurable in future.
Removed default spacy tokenizer in TextPreprocessor: this is redundant as can just use SpacyNLP. For better separation of concern this component is just for very simple text preprocessing - the default is .split() but users can also pass in a tokenizer object (Callable) to use with the component.
And finally, added MedicalCodingPipeline and SummarizationPipeline implementation.
- the pipeline does some internal coercion to make the task either ner or summarization, but no strict validation yet

Testing

Added tests for:

CdsCardCreator: test_card_creator.py
ModelRouter: test_modelrouter.py
pipeline .load() method: test_pipeline_load.py
Pipeline implementations: test_medicalcoding.py, test_summarization.py
check that kwargs are properly propagated in integration components: test_integrations.py
check that TextPreprocessor initializes tokenizer object - test_preprocessor.py
updated tests for Document methods - test_containers.py
Documentation
Updated relevant documentation
Updated cookbook examples

dotimplement / HealthChain

MedicalCodingPipeline and SummarizationPipeline implementations #95

Description

Related Issue

55

Changes Made

Testing

Documentation