The Man and the Machines

Data story (Update for P3)

We documented our journey in a data story available at this link

Abstract

When AlphaGo defeated Lee Sedol in 2016, this caused a shake up in the AI world ¹. For a long time, having an AI beat a human in Go was seen as impossible, but thanks to recent advancements, this was suddenly achieved. Now, we will go onto putting AI to the test in a different game: Wikispeedia (also known as The Wikipedia Game).

When humans play this game, they tend to focus on finding a common hub that contains various topics, and then narrowing down the search as needed ². How would this differ from an AI that works with semantic distance? Are humans going to reign in this search game?

The Man and The Machine focuses on this, comparing an AI that has similar constraints to humans, and trying to find out how they compare. The project starts from human navigation paths, defined by the users playing the Wikispeedia game, to then dwell into the navigation ability of the machine, through semantic-distance-based paths: how differently will the two entities approach this same problem?

Research questions (Update for P3)

Which heuristics are best at predicting the shortest path in the Wikispeedia graph?
Do any of these heuristics outperform humans?
Can we identify any patterns or recurring structures in the human paths (ie. going for a central hub)?
What insights can we draw from the cases where human paths outperform the machine ones? And vice versa?
Is there any way to characterize these differences in performance? Categories, semantic distance, path length, or something else?
How do these compare to the optimal path?

Project Repository Structure

This repository contains various components essential for the project. Below is a detailed description of the organization and content of this repository.

Final Deliverable

final_deliverable.ipynb

Description: This Jupyter notebook contains the main logic and algorithms.

Helper Modules

`data_readers.py`

Purpose: Functions for importing and processing data.

`machine_searchers.py`

Purpose: AI algorithm implementations to find the path between two articles.

`plot_functions.py`

Purpose: Utilities for creating and managing plots and visualizations.

Data Storage

Datasets Folder

Content: Original, unmodified input data from Wikispeedia.

Figures Folder

Content: Figures in file format for web implementation or documentation.

Machine Outputs Folder

Content: Paths and results generated by AI algorithms.

Supplementary Notebooks

`other_notebooks_used` Folder

Content: Additional Jupyter notebooks that provide insights into the thought process behind the data analysis and development of algorithms.

Methods

First of all, we will use NetworkX to create the Wikispeedia graph to work on. For the initial exploratory analysis, along with the degrees of the nodes, we are using PageRank to measure the centrality of the nodes (Fig. 1; Fig. 2). After this initial exploration, where we did not encounter any missing values, we will clean the dataset from outliers we do not need. Finally, to identify trends and differences in the patterns, we will group the human paths by the category of the target node, see their how correlated the categories are with the length of the shortest paths (ANOVA test) and test the differences between the distributions. Other tests will be done, but we first want to get a proper understanding of how humans explore the path.

To compute the semantic distances between each page, we will embed each title into a vector, through the Bert transformers, and we will use the cosine similarity as our main similarity metric. After computing our similarity matrix (whose values will only be between -1 and +1, by definition), we will define the semantic distance as $dist(x, y) = 1 - cos \textunderscore similarity (x, y) + 1$: the first two terms are essential to represent the semantic distance as the opposite of the semantic similarity (the more similar two titles are, the closer they should be in the graph), the second $+1$ is a design choice to work only with values greater than 1 and effectively implement the A* search algorithm.

As mentioned above, we will implement the A* search algorithm which uses an approximation for the node distance, $f(n) = g(n) + h(n)$ assuming it's worse than the actual distance, to explore the graph, where $g(n)$ is the cost from the start node to n, and $h(n)$ is the estimated cost from n to the goal, balancing exact path costs with heuristic estimates to find the shortest paths. We will then find the shortest paths using the semantic distance between the nodes as the weights of the edges constructed between the linked articles, and compare them to the shortest paths provided in the dataset (given that the research of the shortest path is based on heuristics, we can expect significant differences based on the different design choices).

Furthermore, given the multitude of navigation paths that are provided along with the Wikispeedia network, we will be able to compare the human navigation paths with the "semantic-paths": to do so, we will group the paths with the same starting node and target node by "human path is shorter", "semantic path is shorter", "both approaches have the same length". This will be useful to draw our conclusions and ultimately look for patterns within the different performances measured, with targets belonging to the same Wikipedia category sharing the same performance.

More detail on our plan and mathematical details can be found on the GitHub issues page.

Virtual environment

For a smooth collaboration and code execution, we created a virtual environment (included in this GitHub) called as our team name, amonavis. To activate it, run source ./amonavis/bin/activate

Proposed timeline and organisation within the team

For an effective collaboration, we are using GitHub issues to share our to-dos, label each entry by priority and scope and assign the tasks between un. This will help us sharing our progress and mutually support each other's tasks.

Our timeline is defined in 5 weeks, from P2 to P3:

By the end of week 1, we will finish the data cleaning divide us in two groups: one for the main descriptive statistics and testing part (Carlos, Carolina, Daniele) and one for the graph algorithms design and implementation (Nicolas, Sophea);
by the end of week 2, we will answer the first 3 research questions, meaning we will have mastered our way through the dataset provided;
by the end of week 3, we will answer the last 3 research questions, meaning we will have mastered the graphs algorithms and the new data generated and we will be done with our comparisons;
by the end of week 4, we will choose the final plots and elements of the notebook to include in the blog page, for a comprehensive and exhaustive storytelling;
by the end of week 5, we will do our final reviews of the blog and polish the design for an effective publication and presentation.

Questions for TAs:

We incurred in extremely slow computations when evaluating the similarities: any tips to optimize the process?

Figures

hello
Fig. 1: Nodes with above-average PageRank values (importance of the incoming links defines article ranking)

hello
Fig. 2: Nodes with higher degrees (importance in terms of number of connections within the graph)

References

1) “AlphaGo Movie”. AlphaGo Movie. Archived from the original on 3 January 2018. Retrieved 14 October 2017. 2) Erly, A. (2021) The Original Chess Engine: Alan Turing’s Turochamp, Chess.com. Available at: https://www.chess.com/blog/the_real_greco/the-original-chess-engine-alan-turings-turochamp (Accessed: 21 December 2023). 3) Deep Blue (no date) IBM. Available at: https://www.ibm.com/history/deep-blue (Accessed: 21 December 2023). 4) “Wikispeedia: An Online Game for Inferring Semantic Distances between Concepts”. Robert West, Joelle Pineau, and Doina Precup 21st International Joint Conference on Artificial Intelligence (IJCAI), 2009. 5) Potamias, Michalis, et al. “Fast shortest path distance estimation in large networks.” Proceedings of the 18th ACM conference on Information and knowledge management. 2009. 6) https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

epfl-ada / ada-2023-project-amonavis

readme

The Man and the Machines

Data story (Update for P3)

Abstract

Research questions (Update for P3)

Project Repository Structure

Final Deliverable

Helper Modules

`data_readers.py`

`machine_searchers.py`

`plot_functions.py`

Data Storage

Datasets Folder

Figures Folder

Machine Outputs Folder

Supplementary Notebooks

`other_notebooks_used` Folder

Methods

Virtual environment

Proposed timeline and organisation within the team

Questions for TAs:

Figures

References

epfl-ada / ada-2023-project-amonavis

readme

The Man and the Machines

Data story (Update for P3)

Abstract

Research questions (Update for P3)

Project Repository Structure

Final Deliverable

Helper Modules

data_readers.py

machine_searchers.py

plot_functions.py

Data Storage

Datasets Folder

Figures Folder

Machine Outputs Folder

Supplementary Notebooks

other_notebooks_used Folder

Methods

Virtual environment

Proposed timeline and organisation within the team

Questions for TAs:

Figures

References

`data_readers.py`

`machine_searchers.py`

`plot_functions.py`

`other_notebooks_used` Folder