This is quite a lot of work. I will summarize it in TODOs here based on the paper:
Baseline features (section 2.4 in the paper)
[ ] Query features (4/7)
[ ] Get data from Wikipedia (inlinks, outlinks, tables, page views)
[ ] Table fatures (3/9)
[ ] Query-table features (0/7)
Semantic features (section 3 in the paper)
Content extraction (section 3.1 in the paper)
[x] Words extraction from queries
[x] Words extraction from tables
This uses a messy regex now, might be improved when we find how to do entity extraction from table.
[x] Entity extraction from queries
This uses some API call to DBpedia now. Might be differently implemented when we find how to implement entity retrieval. EDIT: Possibly we can also change this to a lookup with the wikipedia python api tool we may also use for entity retrieval OR see if we have a RDF2vec vector for the entity.
[ ] Entity extraction from tables
This could use the [entity_name|text in table] type syntax from Wikipedia. They state in the paper they use this (I think). This also includes:
[ ] Core column detection (section 3.1.3 in the paper)
[ ] Entity retrieval (section 3.1.4 in the paper)
I am not sure how to do the entity retrieval. This seems quite complex / a lot of work. EDIT: Maybe we can use the wikipedia python api for this.
Semantic Representations (section 3.2 in the paper)
[ ] Bag of entities
[ ] Bag of categories
[x] Word embeddings
[ ] Graph embeddings
This also does not seem that easy to implement, I found this which might be helpful. Also maybe use this which seems to contain vectors directly, I am not sure whether this contains all entities though.
Similarity measures (section 3.3 in the paper)
[x] Early fusion
[ ] Early fusion weighed by TFIDF
Right now the word vectors use the wrong function to calculate the early fusion average. This should be changed to a method where they are weighed by their TFIDF as described in the paper.
This is quite a lot of work. I will summarize it in TODOs here based on the paper:
Baseline features (section 2.4 in the paper)
Semantic features (section 3 in the paper) Content extraction (section 3.1 in the paper)
[entity_name|text in table]
type syntax from Wikipedia. They state in the paper they use this (I think). This also includes:Semantic Representations (section 3.2 in the paper)
Similarity measures (section 3.3 in the paper)