Implement a custom TextCodeEncoder that processes clinical codes with their associated text descriptions:
We need an approach that efficiently handles text descriptions for codes while maintaining temporal alignment and avoiding redundant processing of common codes. So we need an _inputencoder which does the following:
Data Preparation (on the class initialization):
Load code descriptions from metadata parquet file
Load a text tokenization function from huggingface autotokenizers
Create a lookup dictionary: code/vocab_index -> tokenized_text
Batch Processing:
Extract unique codes from the batch to avoid redundant processing
Pass tokenized text through ClinicalBERT encoder
Cache encodings for frequently used codes
Temporal Integration:
Map encoded text representations back to original code positions in sequence
Combine with existing triplet embeddings (code + value + time delta)
Implement a custom TextCodeEncoder that processes clinical codes with their associated text descriptions:
We need an approach that efficiently handles text descriptions for codes while maintaining temporal alignment and avoiding redundant processing of common codes. So we need an _inputencoder which does the following:
code/vocab_index -> tokenized_text