Oufattole / meds-torch

MIT License
11 stars 1 forks source link

all_text: Convert the entire patient history into text and use a language model to get an embedding #17

Closed Oufattole closed 2 months ago

Oufattole commented 2 months ago

Pipeline:

  1. Convert history to a big string (use same approach as for observation_text and then concatenate all of those strings)
  2. load tokenizer (could be from pretrained huggingface model) and tokenize into integers
  3. Allow loading of pretrained language models and generation of a representation for downstream tasks (maybe we can do mamba mamba-130m-hf, Masked imputation model: bert, and autoregressive transformer: microsoft/phi-1_5. We should add some caching support later, maybe just using safetensors with a dictionary from event ID to the tensor.
Oufattole commented 2 months ago

Pipeline

  1. process_tripletprocess single event stream into a text string
  2. tokenize this into integers
  3. collate function to get a batch of these
  4. Feed this to language model