ArneBinder / recursive-embedding

Train embeddings for hierarchically structured data.
Apache License 2.0
2 stars 1 forks source link
dynamic-computation-graphs embeddings machine-learning

Recursive-Embedding (rec-emb)

Train embeddings for hierarchically structured data. rec-emb is a research project.

Idea

A lot of real world phenomena are structured hierarchically. Creating a semantic model that exploits tree structure seems to be a native need.

TL;DR, the docker folder provides several starting points:

The Embedding Model

The rec-emb embedding model

The Data Model

The rec-emb data model

Preprocessing

The rec-emb data model includes the following:

Currently, the project provides three data sources:

See the preprocessing docker setup for how to parse and convert these datasets into the rec-emb format.

The rec-emb Embedding Model

The embedding model creates a single embedding for any tree generated from the graph structure. It utilizes a reduce function to combine children and apply a map function along edges. Together, they form a Headed Tree Unit (HTU). Both functions can depend on the node data (at least one of them should depend on it).

Implemented reduce functions:

Implemented map functions:

HTU implementations:

Similarity Scoring

Given two embeddings, calculate a floating point value that indicates their similarity. Here, one embedding may consist of one or multiple concatenated tree embeddings, eventually further transformed with one or multiple FCs. Furthermore, one embedding may be a class label embedding.

Implemented similarity functions:

Training

See the training docker setup for how to train and test models for individual tasks.

TASK: Predict Relatedness Scores

Predict, how strongly two trees are related.

"Related" can be interpreted in several ways, e.g., in the case of SICK, several annotators scored pairs of sentences intuitively and the resulting scores are averaged.

The similarity score between two tree embeddings is used as measure for the strength of the relatedness. In general, one FC is applied to each tree embedding before scoring.

Task instances:

TASK: Multiclass Prediction

Predict, if a tree matches one (or multiple) labels.

The similarity score between one (or multiple concatenated) tree embeddings and the class label embedding is used as probability for that the instance belongs to that class. In general, one FC is applied to the (concatenated in the case of RTE) tree embedding(s).

Task instances:

Evaluation

See the eval notebook for evaluation of model prediction results. This notebook was used to examine overall prediction quality and resource consumption. Furthermore, it contains record wise result evaluation.

License

Copyright 2019 Arne Binder

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.