Oufattole / meds-torch

MIT License
16 stars 2 forks source link

Code-to-Text Integration #119

Open Oufattole opened 3 weeks ago

Oufattole commented 3 weeks ago

Implement a custom TextCodeEncoder that processes clinical codes with their associated text descriptions:

We need an approach that efficiently handles text descriptions for codes while maintaining temporal alignment and avoiding redundant processing of common codes. So we need an _inputencoder which does the following:

  1. Data Preparation (on the class initialization):
    • Load code descriptions from metadata parquet file
    • Load a text tokenization function from huggingface autotokenizers
    • Create a lookup dictionary: code/vocab_index -> tokenized_text
  2. Batch Processing:
    • Extract unique codes from the batch to avoid redundant processing
    • Pass tokenized text through ClinicalBERT encoder
    • Cache encodings for frequently used codes
  3. Temporal Integration:
    • Map encoded text representations back to original code positions in sequence
    • Combine with existing triplet embeddings (code + value + time delta)