Basic object encoding - Githubissues

JacobFV commented 2 years ago

I think a first step in the tensorcode project is being able to encode any Python object, like so,

import tensorcode as tc

x = 'the dog'
x_enc = tc.encode(x)

y = [4, {3:2, 'key':'value'}, None, np.ones((1,2,3))]
y_enc = tc.encode(y)

z = MyCompositeDataType(1, 2, 'three', Four(), 5*[None,], (y_enc, y, x))
z_enc = tc.encode(z)

Concepts

There are three main classes of types: atomic, complex, and composite.
- By "atomic" types, I mean int's, str's, float's, etc.
- By "complex" types, I mean list's, dict's, set's, and other built-in 'container' types.
- By "composite" types, I mean custom classes, eg, MyClass, myObj.__type__. Composite types might be simplified to dict's.
Encodings are vector embeddings that are representative of their corresponding Python object. Complex and composite encodings should capture both 1) the values of their constituients and 2) their internal structural organization.
No single neural network architecture is best suited to encode all possible types. (In most cases, it's physically impossible, eg, try to fit a str inside a float or convert a matrix into a scalar without loosing information.) I think it's better if we have a unique encoder for each of the fundamental "atomic" types, a few encoders to cover all classes of "complex" types, and a single encoder to handle arbitrary "composite" types. Here's just one sketch,
- encode str's with bert
- encode int's, float's, and bool's just by converting them directly to tensors
- encode set's, list's, and other iterables by recursively encoding each element and then aggregating the encodings through a transformer or RNN
- encode composite types by a) recursively encoding each of their children down to atomic types up until a depth limit is reached, or b) treat the composite object like a vert of interest on a larger graph of connected attributes and perform message passing / graph attention for N rounds and return the final hidden state on that vert. (The Python scope can be viewed as a heterogenous directed graph: object's are vertices, attributes are edges, and each object's __dict__ is its adjacency list. google-research / python-graphs may be useful.)

Details

Whatever algorithms/models we use, they should be (mostly) end-to-end differentiable. I write "mostly" because there might be useful algorithms that involve hard-attention or other 0th order techniques.

It should be possible to specify the output dimensionality, like so:

# tensorcode.py
def encode(obj, dimensionality=DEFAULT_DIMENSIONALITY):
  ...

It should be possible to add special cases at runtime. Something along the lines of:

# tensorcode.add_encoder(<type_to_match>, <encoder_function>)
tensorcode.add_encoder(MyCustomType, lambda my_custom_type: ...)

String encoding seem like a very special "special case", since strings can contain all sorts of content, eg, natural language, formal language, object serializations, arbitrary bytes, etc. There should be some convenient way to set the language model / tokenizer employed for string encoding. Maybe like this:

# before tensorcode
tokens = tokenizer.encode(['input sentence'])
outputs = model(tokens)
encoding = outputs['hidden state'][0, -1, :]  # unbatch and get the last hidden state

# after tensorcode
tensorcode.config.set_tokenizer(tokenizer) 
tensorcode.config.set_lm(model)

encoding = tensorcode.encode('input sentence')

Many rich semantic types have entire zoo's of models that can encode them, eg, 10,000's of LM's on huggingface for classifying (and in the process of classifying, also encoding) text, about 1000 for images, and about a 100 for audio. I think it'd be convenience if tensorcode provided direct integration with huggingface / transformers, huggingface / diffusers, Stability-AI/diffusers, and other model zoo's.

tensorcode.encode should support tangled cyclic composition relationships such as:

a = [None, None]; b = [None, None]
a[0] = a; a[1] = b; b[0] = a; b[1] = b;
tc.encode([a, b])

I suggested recursive encoding for complex and composite data types, but since the objects may have cyclic relationships, I don't know if this is a good strategy. On the other hand, the GNN-based approach is ameable to all kinds of bells and whistles, such as typed edges, hard/soft attention, and parallelization. However, it might still be advantageous to implement some kind of an edge-case for verts that represent an iterable type, since the iterable itself is a chain graph which usually takes a long time to propagate down.
There are other relationships that are not directly captured in the __dict__ attribute such as the signature (via inspect) and AST of a function (via parser) or a type's __mro__ and __subclasses__ (additional hidden attrs on every python object). Don't worry about them for now.
We'll want to be able to swap out the underlying ML algorithm, so we can quickly test lots of them, eg, GCN vs. GAT vs. GIN vs. Perciever vs. .... This means we should break up the algorithm into separate interchangable modules (which will also be convenient for later functions)
I apolagize for making a mess in the source code. Feel free to delete or modify anything that gets in the way.
The jax framework seems advantageous for its speed. However, many of the model zoos' models are written for pytorch or tensorflow. So I recommend implementing as much of the code as possible using the unifyai / ivy framework. This will make it easier for people who use the library to fit it into various use cases. However, it's fine if we have to use a specific ML backend.

Benchmark

Let's aim to make this function tensorcode.encode able to encode the following on a gpu-accelerated gcp deep learning vm / aws deep learning ami (or smaller). Later tasks will focus on contrastive learning, clustering, and decoding these encodings, but for this task, let's focus on just making the encoding:

Unit tests:

objects = [
  False, 1, 1.0,  'String',  # atomic types
  [1, 2, 3], (1, 2, 3), {1, 2, 3}, {1: 'one', 2: 'two', 3: 'three'},  # complex types
  object()  # composite types
]

for obj in objects:
  tensorcode.encode(obj)

Contrived examples

a = [None, None]; b = [None, None]
a[0] = a; a[1] = b; b[0] = a; b[1] = b;
tc.encode([a, b])

Simplify existing ML dev work

tensorcode.config.set_tokenizer(tokenizer) 
tensorcode.config.set_lm(model)
encoding = tensorcode.encode('input sentence')

AGI use-cases

import ast
import numpy as np

Command = ast.parse('''
  def add(a, b):
     return a + b
  ''') 
class Orientation:
  pitch: float
  yaw: float
  roll: float
class Location:
  x: float
  y: float
  z: float
class Observation:
  image: np.array
  command: Command
  orientation: Orientation
  location: Location
# initialized ommited for brevity

obs = # todo: initialize random observation
tensorcode.encode(obs)

(Fyi, I didn't write any of this code in an IDE so it may be error-ridden.)

JacobFV commented 2 years ago

Notes from today's meeting:

It's hard to compare "four" with 4
- maybe overcome this using a weighted mixture, eg, $w{text} x{text} + w{num} x{num}$ where
  - $w{text}, w{num} = softmax({ w1 \cdot x{text}, w2 \cdot x{num} })$
  - $x{text}, x{num}$ are semantic and positional encodings respectively
We might encode composite objects like so: code → AST → graph
- code → AST is implemented in ast and tree_sitter
- We need to find an AST → graph library.
  - Moved to new issue #2
  - The library we find may not do everything we want, but it'll most likely be simpler than implementing and maintaining an AST → graph tool ourselves

JacobFV commented 2 years ago

Actually, this may be the most concise and compelling benchmark for basic object encoding: tensorcode.encode should be able to extract meaningful information from arbitrary REST API JSON responses without needing to know in-advance what the shape of the JSON will be.

For example, suppose I query conceptnet as demonstrated here:

>>> import requests
>>> obj = requests.get('http://api.conceptnet.io/c/en/example').json()
>>> obj.keys()
dict_keys(['view', '@context', '@id', 'edges'])

>>> len(obj['edges'])
20

>>> obj['edges'][2]
{'@id': '/a/[/r/IsA/,/c/en/example/n/,/c/en/information/n/]',
 'dataset': '/d/wordnet/3.1',
 'end': {'@id': '/c/en/information/n',
  'label': 'information',
  'language': 'en',
  'sense_label': 'n',
  'term': '/c/en/information'},
 'license': 'cc:by/4.0',
 'rel': {'@id': '/r/IsA', 'label': 'IsA'},
 'sources': [{'@id': '/s/resource/wordnet/rdf/3.1',
   'contributor': '/s/resource/wordnet/rdf/3.1'}],
 'start': {'@id': '/c/en/example/n',
  'label': 'example',
  'language': 'en',
  'sense_label': 'n',
  'term': '/c/en/example'},
 'surfaceText': [[example]] is a type of [[information]]',
 'weight': 2.0}

The node example contains 20 links, and judging by the size of link 2, obj likely contains a truckload of information, of which I will probabbly only use a small percentage.

tensorcode.encode should be able to extract meaningful information from obj like so:

def get_obj(name):
   # the same code from above, now in a function
   return obj

objs = [get_obj(name) for name in ('yen', 'gold', 'au-dollar', 'TSLA', 'bitcoin')]
enc_objs = [tc.encode(obj) for obj in objs]
enc_objs_arr = np.array(enc_objs)  # shape: [5, d_emb]
enc_objs_arr = enc_objs_arr / np.norm(enc_objs_arr, axis=-1)
similarities = einsum('id, jd -> ij', enc_objs_arr, enc_objs_arr)  # shape: [5, 5]

Of course, computing similarity is only one use case for semantic embeddings.

TensaCo / tensacode

Basic object encoding #1

Concepts

Details

Benchmark