Train embeddings for hierarchically structured data. rec-emb is a research project.
A lot of real world phenomena are structured hierarchically. Creating a semantic model that exploits tree structure seems to be a native need.
TL;DR, the docker
folder provides several starting points:
The rec-emb embedding model
The rec-emb data model
The rec-emb data model includes the following:
Currently, the project provides three data sources:
neutral
, entailment
, or contradiction
).cause-effect
or message-topic
.See the preprocessing docker setup for how to parse and convert these datasets into the rec-emb format.
The embedding model creates a single embedding for any tree generated from the graph structure.
It utilizes a reduce
function to combine children and apply a map
function along edges.
Together, they form a Headed Tree Unit (HTU). Both functions
can depend on the node data (at least one of them should depend on it).
Implemented reduce
functions:
Implemented map
functions:
HTU implementations:
reduce
and incorporate the node data via one map
execution.map
individually and reduce afterwards.map
step at all. Just execute one FC to allow for different
dimensionality of (word) embeddings and internal state.Given two embeddings, calculate a floating point value that indicates their similarity. Here, one embedding may consist of one or multiple concatenated tree embeddings, eventually further transformed with one or multiple FCs. Furthermore, one embedding may be a class label embedding.
Implemented similarity functions:
See the training docker setup for how to train and test models for individual tasks.
Predict, how strongly two trees are related.
"Related" can be interpreted in several ways, e.g., in the case of SICK, several annotators scored pairs of sentences intuitively and the resulting scores are averaged.
The similarity score between two tree embeddings is used as measure for the strength of the relatedness. In general, one FC is applied to each tree embedding before scoring.
Task instances:
Predict, if a tree matches one (or multiple) labels.
The similarity score between one (or multiple concatenated) tree embeddings and the class label embedding is used as probability for that the instance belongs to that class. In general, one FC is applied to the (concatenated in the case of RTE) tree embedding(s).
Task instances:
See the eval notebook for evaluation of model prediction results. This notebook was used to examine overall prediction quality and resource consumption. Furthermore, it contains record wise result evaluation.
Copyright 2019 Arne Binder
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.