epfl-ada / ada-2023-project-amonavis

ada-2023-project-amonavis created by GitHub Classroom
0 stars 0 forks source link

The Man and the Machines

Data story (Update for P3)

We documented our journey in a data story available at this link

Abstract

When AlphaGo defeated Lee Sedol in 2016, this caused a shake up in the AI world 1. For a long time, having an AI beat a human in Go was seen as impossible, but thanks to recent advancements, this was suddenly achieved. Now, we will go onto putting AI to the test in a different game: Wikispeedia (also known as The Wikipedia Game).

When humans play this game, they tend to focus on finding a common hub that contains various topics, and then narrowing down the search as needed 2. How would this differ from an AI that works with semantic distance? Are humans going to reign in this search game?

The Man and The Machine focuses on this, comparing an AI that has similar constraints to humans, and trying to find out how they compare. The project starts from human navigation paths, defined by the users playing the Wikispeedia game, to then dwell into the navigation ability of the machine, through semantic-distance-based paths: how differently will the two entities approach this same problem?

Research questions (Update for P3)

Project Repository Structure

This repository contains various components essential for the project. Below is a detailed description of the organization and content of this repository.

Final Deliverable

final_deliverable.ipynb

Helper Modules

data_readers.py

machine_searchers.py

plot_functions.py

Data Storage

Datasets Folder

Figures Folder

Machine Outputs Folder

Supplementary Notebooks

other_notebooks_used Folder

Methods

First of all, we will use NetworkX to create the Wikispeedia graph to work on. For the initial exploratory analysis, along with the degrees of the nodes, we are using PageRank to measure the centrality of the nodes (Fig. 1; Fig. 2). After this initial exploration, where we did not encounter any missing values, we will clean the dataset from outliers we do not need. Finally, to identify trends and differences in the patterns, we will group the human paths by the category of the target node, see their how correlated the categories are with the length of the shortest paths (ANOVA test) and test the differences between the distributions. Other tests will be done, but we first want to get a proper understanding of how humans explore the path.

To compute the semantic distances between each page, we will embed each title into a vector, through the Bert transformers, and we will use the cosine similarity as our main similarity metric. After computing our similarity matrix (whose values will only be between -1 and +1, by definition), we will define the semantic distance as $dist(x, y) = 1 - cos \textunderscore similarity (x, y) + 1$: the first two terms are essential to represent the semantic distance as the opposite of the semantic similarity (the more similar two titles are, the closer they should be in the graph), the second $+1$ is a design choice to work only with values greater than 1 and effectively implement the A* search algorithm.

As mentioned above, we will implement the A* search algorithm which uses an approximation for the node distance, $f(n) = g(n) + h(n)$ assuming it's worse than the actual distance, to explore the graph, where $g(n)$ is the cost from the start node to n, and $h(n)$ is the estimated cost from n to the goal, balancing exact path costs with heuristic estimates to find the shortest paths. We will then find the shortest paths using the semantic distance between the nodes as the weights of the edges constructed between the linked articles, and compare them to the shortest paths provided in the dataset (given that the research of the shortest path is based on heuristics, we can expect significant differences based on the different design choices).

Furthermore, given the multitude of navigation paths that are provided along with the Wikispeedia network, we will be able to compare the human navigation paths with the "semantic-paths": to do so, we will group the paths with the same starting node and target node by "human path is shorter", "semantic path is shorter", "both approaches have the same length". This will be useful to draw our conclusions and ultimately look for patterns within the different performances measured, with targets belonging to the same Wikipedia category sharing the same performance.

More detail on our plan and mathematical details can be found on the GitHub issues page.

Virtual environment

For a smooth collaboration and code execution, we created a virtual environment (included in this GitHub) called as our team name, amonavis. To activate it, run source ./amonavis/bin/activate

Proposed timeline and organisation within the team

For an effective collaboration, we are using GitHub issues to share our to-dos, label each entry by priority and scope and assign the tasks between un. This will help us sharing our progress and mutually support each other's tasks.

Our timeline is defined in 5 weeks, from P2 to P3:

Questions for TAs:

Figures

hello
Fig. 1: Nodes with above-average PageRank values (importance of the incoming links defines article ranking)

hello
Fig. 2: Nodes with higher degrees (importance in terms of number of connections within the graph)

References

1) “AlphaGo Movie”. AlphaGo Movie. Archived from the original on 3 January 2018. Retrieved 14 October 2017. 2) Erly, A. (2021) The Original Chess Engine: Alan Turing’s Turochamp, Chess.com. Available at: https://www.chess.com/blog/the_real_greco/the-original-chess-engine-alan-turings-turochamp (Accessed: 21 December 2023). 3) Deep Blue (no date) IBM. Available at: https://www.ibm.com/history/deep-blue (Accessed: 21 December 2023). 4) “Wikispeedia: An Online Game for Inferring Semantic Distances between Concepts”. Robert West, Joelle Pineau, and Doina Precup 21st International Joint Conference on Artificial Intelligence (IJCAI), 2009. 5) Potamias, Michalis, et al. “Fast shortest path distance estimation in large networks.” Proceedings of the 18th ACM conference on Information and knowledge management. 2009. 6) https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2