Open JacobFV opened 2 years ago
Notes from today's meeting:
4
ast
and tree_sitter
Actually, this may be the most concise and compelling benchmark for basic object encoding: tensorcode.encode
should be able to extract meaningful information from arbitrary REST API JSON responses without needing to know in-advance what the shape of the JSON will be.
For example, suppose I query conceptnet
as demonstrated here:
>>> import requests
>>> obj = requests.get('http://api.conceptnet.io/c/en/example').json()
>>> obj.keys()
dict_keys(['view', '@context', '@id', 'edges'])
>>> len(obj['edges'])
20
>>> obj['edges'][2]
{'@id': '/a/[/r/IsA/,/c/en/example/n/,/c/en/information/n/]',
'dataset': '/d/wordnet/3.1',
'end': {'@id': '/c/en/information/n',
'label': 'information',
'language': 'en',
'sense_label': 'n',
'term': '/c/en/information'},
'license': 'cc:by/4.0',
'rel': {'@id': '/r/IsA', 'label': 'IsA'},
'sources': [{'@id': '/s/resource/wordnet/rdf/3.1',
'contributor': '/s/resource/wordnet/rdf/3.1'}],
'start': {'@id': '/c/en/example/n',
'label': 'example',
'language': 'en',
'sense_label': 'n',
'term': '/c/en/example'},
'surfaceText': [[example]] is a type of [[information]]',
'weight': 2.0}
The node example
contains 20 links, and judging by the size of link 2, obj
likely contains a truckload of information, of which I will probabbly only use a small percentage.
tensorcode.encode
should be able to extract meaningful information from obj
like so:
def get_obj(name):
# the same code from above, now in a function
return obj
objs = [get_obj(name) for name in ('yen', 'gold', 'au-dollar', 'TSLA', 'bitcoin')]
enc_objs = [tc.encode(obj) for obj in objs]
enc_objs_arr = np.array(enc_objs) # shape: [5, d_emb]
enc_objs_arr = enc_objs_arr / np.norm(enc_objs_arr, axis=-1)
similarities = einsum('id, jd -> ij', enc_objs_arr, enc_objs_arr) # shape: [5, 5]
Of course, computing similarity is only one use case for semantic embeddings.
I think a first step in the tensorcode project is being able to encode any Python object, like so,
Concepts
int
's,str
's,float
's, etc.list
's,dict
's,set
's, and other built-in 'container' types.MyClass
,myObj.__type__
. Composite types might be simplified todict
's.str
inside afloat
or convert a matrix into a scalar without loosing information.) I think it's better if we have a unique encoder for each of the fundamental "atomic" types, a few encoders to cover all classes of "complex" types, and a single encoder to handle arbitrary "composite" types. Here's just one sketch,str
's withbert
int
's,float
's, andbool
's just by converting them directly to tensorsset
's,list
's, and other iterables by recursively encoding each element and then aggregating the encodings through a transformer or RNNobject
's are vertices, attributes are edges, and each object's__dict__
is its adjacency list. google-research / python-graphs may be useful.)Details
String encoding seem like a very special "special case", since strings can contain all sorts of content, eg, natural language, formal language, object serializations, arbitrary bytes, etc. There should be some convenient way to set the language model / tokenizer employed for string encoding. Maybe like this:
tensorcode.encode
should support tangled cyclic composition relationships such as:__dict__
attribute such as the signature (viainspect
) and AST of a function (viaparser
) or a type's__mro__
and__subclasses__
(additional hidden attrs on every python object). Don't worry about them for now.Benchmark
Let's aim to make this function
tensorcode.encode
able to encode the following on a gpu-accelerated gcp deep learning vm / aws deep learning ami (or smaller). Later tasks will focus on contrastive learning, clustering, and decoding these encodings, but for this task, let's focus on just making the encoding:Unit tests:
AGI use-cases
(Fyi, I didn't write any of this code in an IDE so it may be error-ridden.)