Tokenize all the things - genomic omni-model

OpenAI's GPT-4o is not open-source. I was reading a reddit thread recently speculating on it's architecture which enables text, vision, and voice modalities into one model.

One user speculates:

I wonder if it's something closer to the original DALL-E where the image was decomposed into image tokens ... The embeddings of the image tokens and text tokens could share the same latent space, so that model was "natively" multimodal.

Another replies:

Yes, I think that's exactly it ... 'Just' train a encoder tokenizer for each modality, maybe define some of the extra 100k BPEs as modality-specific delimiters similar to delimiting prompts/end-of-text tokens - and then it's just 'tokenize all the things'

Which all got me thinking... what would it look like to "tokenize all the things" in a genomic context? We have modalities like, scATAC-seq, scRNA-seq, methylation, and then even textual metadata associated with these datasets. I've proposed two multi-modal architectures in the past: one being the scRNA-seq tokenizer (https://github.com/databio/geniml_dev/issues/123), and then a CLIP-like model. But could we come up with ideas to "tokenize all the things" such that a model could take anything as input (scRNA-seq, scATAC-seq, methylation, or text) and then either 1) output an embedding, or 2) generate another modality.

Of course at the end of the day... we need the datasets :/

databio / gtars

Tokenize all the things - genomic omni-model #19