TensaCo / tensacode

2 stars 1 forks source link

Basic object encoding #1

Open JacobFV opened 2 years ago

JacobFV commented 2 years ago

I think a first step in the tensorcode project is being able to encode any Python object, like so,

import tensorcode as tc

x = 'the dog'
x_enc = tc.encode(x)

y = [4, {3:2, 'key':'value'}, None, np.ones((1,2,3))]
y_enc = tc.encode(y)

z = MyCompositeDataType(1, 2, 'three', Four(), 5*[None,], (y_enc, y, x))
z_enc = tc.encode(z)

Concepts

Details

Benchmark

Let's aim to make this function tensorcode.encode able to encode the following on a gpu-accelerated gcp deep learning vm / aws deep learning ami (or smaller). Later tasks will focus on contrastive learning, clustering, and decoding these encodings, but for this task, let's focus on just making the encoding:

  1. Unit tests:

    objects = [
      False, 1, 1.0,  'String',  # atomic types
      [1, 2, 3], (1, 2, 3), {1, 2, 3}, {1: 'one', 2: 'two', 3: 'three'},  # complex types
      object()  # composite types
    ]
    
    for obj in objects:
      tensorcode.encode(obj)
  2. Contrived examples
    a = [None, None]; b = [None, None]
    a[0] = a; a[1] = b; b[0] = a; b[1] = b;
    tc.encode([a, b])
  3. Simplify existing ML dev work
    tensorcode.config.set_tokenizer(tokenizer) 
    tensorcode.config.set_lm(model)
    encoding = tensorcode.encode('input sentence')
  4. AGI use-cases

    import ast
    import numpy as np
    
    Command = ast.parse('''
      def add(a, b):
         return a + b
      ''') 
    class Orientation:
      pitch: float
      yaw: float
      roll: float
    class Location:
      x: float
      y: float
      z: float
    class Observation:
      image: np.array
      command: Command
      orientation: Orientation
      location: Location
    # initialized ommited for brevity
    
    obs = # todo: initialize random observation
    tensorcode.encode(obs)

(Fyi, I didn't write any of this code in an IDE so it may be error-ridden.)

JacobFV commented 2 years ago

Notes from today's meeting:

JacobFV commented 2 years ago

Actually, this may be the most concise and compelling benchmark for basic object encoding: tensorcode.encode should be able to extract meaningful information from arbitrary REST API JSON responses without needing to know in-advance what the shape of the JSON will be.

For example, suppose I query conceptnet as demonstrated here:

>>> import requests
>>> obj = requests.get('http://api.conceptnet.io/c/en/example').json()
>>> obj.keys()
dict_keys(['view', '@context', '@id', 'edges'])

>>> len(obj['edges'])
20

>>> obj['edges'][2]
{'@id': '/a/[/r/IsA/,/c/en/example/n/,/c/en/information/n/]',
 'dataset': '/d/wordnet/3.1',
 'end': {'@id': '/c/en/information/n',
  'label': 'information',
  'language': 'en',
  'sense_label': 'n',
  'term': '/c/en/information'},
 'license': 'cc:by/4.0',
 'rel': {'@id': '/r/IsA', 'label': 'IsA'},
 'sources': [{'@id': '/s/resource/wordnet/rdf/3.1',
   'contributor': '/s/resource/wordnet/rdf/3.1'}],
 'start': {'@id': '/c/en/example/n',
  'label': 'example',
  'language': 'en',
  'sense_label': 'n',
  'term': '/c/en/example'},
 'surfaceText': [[example]] is a type of [[information]]',
 'weight': 2.0}

The node example contains 20 links, and judging by the size of link 2, obj likely contains a truckload of information, of which I will probabbly only use a small percentage.

tensorcode.encode should be able to extract meaningful information from obj like so:

def get_obj(name):
   # the same code from above, now in a function
   return obj

objs = [get_obj(name) for name in ('yen', 'gold', 'au-dollar', 'TSLA', 'bitcoin')]
enc_objs = [tc.encode(obj) for obj in objs]
enc_objs_arr = np.array(enc_objs)  # shape: [5, d_emb]
enc_objs_arr = enc_objs_arr / np.norm(enc_objs_arr, axis=-1)
similarities = einsum('id, jd -> ij', enc_objs_arr, enc_objs_arr)  # shape: [5, 5]

Of course, computing similarity is only one use case for semantic embeddings.