Feature Extraction - Githubissues

This is quite a lot of work. I will summarize it in TODOs here based on the paper:

Baseline features (section 2.4 in the paper)

[ ] Query features (4/7)
- [ ] Get data from Wikipedia (inlinks, outlinks, tables, page views)
[ ] Table fatures (3/9)
[ ] Query-table features (0/7)

Semantic features (section 3 in the paper) Content extraction (section 3.1 in the paper)

[x] Words extraction from queries
[x] Words extraction from tables This uses a messy regex now, might be improved when we find how to do entity extraction from table.
[x] Entity extraction from queries This uses some API call to DBpedia now. Might be differently implemented when we find how to implement entity retrieval. EDIT: Possibly we can also change this to a lookup with the wikipedia python api tool we may also use for entity retrieval OR see if we have a RDF2vec vector for the entity.
[ ] Entity extraction from tables This could use the [entity_name|text in table] type syntax from Wikipedia. They state in the paper they use this (I think). This also includes:
- [ ] Core column detection (section 3.1.3 in the paper)
- [ ] Entity retrieval (section 3.1.4 in the paper) I am not sure how to do the entity retrieval. This seems quite complex / a lot of work. EDIT: Maybe we can use the wikipedia python api for this.

Semantic Representations (section 3.2 in the paper)

[ ] Bag of entities
[ ] Bag of categories
[x] Word embeddings
[ ] Graph embeddings This also does not seem that easy to implement, I found this which might be helpful. Also maybe use this which seems to contain vectors directly, I am not sure whether this contains all entities though.

Similarity measures (section 3.3 in the paper)

[x] Early fusion
[ ] Early fusion weighed by TFIDF Right now the word vectors use the wrong function to calculate the early fusion average. This should be changed to a method where they are weighed by their TFIDF as described in the paper.
[x] Late fusion

Dahny / IN4325-Core-IR-Project