The aim of this PDL project is to extract tables in CSV format from Wikipedia pages. Those pages can be analyzed in two different ways: By searching for the corresponding Wikitext code By exploiting the HTML rendering of the Wikipedia page
Both approaches would be compared and tested (in order to have the same CSV output).
Wikipedia tables are difficult to exploit by statistical tools, visualization or any tool able to exploit tables (e.g., Excel, OpenOffice, RStudio, Jupyter). These tables are written in a syntax (Wikitext) difficult to analyze and not necessarily designed for the specification of tables. In addition, there is a strong heterogeneity in the way tables are written, further complicating Wikipedia's tabular data processing. Same can be said for HTML format.
It is very simple and above all supported by many tools.
This project is about implementing a solution and specify a ground truth ("ground truth") and thus evaluate different extractors by confronting them to the ground truth. Also, it must be able to extract several tables on the same Wikipedia page.
Last but not least, this project will propose a set of tools able to analyze the results of the extractors and thus specify a set of expected results (which will then be used during the automatic test phase). Among these tools, one of them will allows to visualize a matrix (resulting from an automatic extractor), possibly to correct the matrix, and then to export it in CSV format.
Finally, a most global suite of tests will demonstrate the quality of our tool.
There will be three concrete results: Extractors of much better quality (with source code, documentation, test suite, continuous integration, etc.) A suite of tools to be able to more easily specify a ground truth and thus help the evaluation of extractors A dataset reusable by anyone wanting to test an array extractor
This is a Master of Business Informatics' project which improve its latest version called "Wikipedia Matrix" (this actuel project has been forked from this one)
(cf : https://github.com/mathieulehan/PDL_2018-2019_GR4).
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See "Parsing wikitables" to start the parsing on a live system.
After cloning the project into your computer, either you open it on your IDE or just run Maven tests. It will run the test classes and then parse tables from more than 300 wikipedia urls.
You can find a demo right here.
An IDE and Maven. JDK 8 to execute maven test.
IntelliJ - https://www.jetbrains.com/idea/
Maven - https://maven.apache.org
How to install it ?
Clone it from git into your computer on your terminal with the following line
git clone https://github.com/Qt-tracker/PDL_2019-2020_GR5.git
Don't forget to convert your project to a Maven one
You are done !
You can find more details in INSTALL.md.
Folders:
the root contains some files, as :
/output contains two folders /HTML & /wikitext, that will contain the parsed wikipedia tables, and one file, url_file.txt, containing the 336 URLs to be parsed.
the /src folder contains three folders :
On IntelliJ : Right click on your project, then choose Run 'All Tests' To run the test with 336 URL choose the file BenchTest.java
For extracting from both HTML and WIKITEXT, you need to go:
On IntelliJ : Double click on the green play button on your right.
You can extract the way you like.
Run the class WikiExtractMain. Then type :
- W to parse files from WIKITEXT to csv
- H to parse files from HTML to csv
- X to parse files from both WIKITEXT and HTML
Extraction via wikitext does not work very well, especially table checking. Json Format is making trouble while extracting a table.
However via HTML we do not encounter any problems.
If there is a table under a table, the CSV given is not valid.
Moreover, many times, extraction via wikitext compare to those via html do not give the same result.
Some little problems have been found as when a false URL is given, it pops out an error without precising which url/title is making trouble. When a page does not have any table it is not said clearly. Until then there is no method to check if the CSV is good, so this is a new method considered. A method that compares two CSVs is well under consideration. Also user can't choice the URL , we put a default URL in the main for extraction
choice of URL by user Another extractor more perform in other langage
IntelliJ - The IDE used
Maven - Dependency Management
JUnit - Used to test
Mockito - Mocking framework
jsoup - Java HTML parser
Apache Commons - Reusable Java components git config --global user.email johndoe@example.com
prototype : the latest prototype built to test the concept
V1 : in this version, last year group putted most HTML tables are parsed successfully. The project's structure is a non-Maven one, we could not run "Maven test". Also, in this version, urls parsing was executed one at a time.
V2 : This version supports the Maven test command & has a simple UI allowing interaction with the user made by the earlier group.
master : the lastest, stable version of the project.
develop : our branch built to test the concept before committing it to master
Koitrin KOFFI - Whole project - Koitrin Koffi
William ZOUNON - Whole project - William Zounon
Laeba TALAT - Whole project - Laeba Talat
Yves KOUASSI - Whole project - Yves Kouassi
Nguyen-Anh CU - Whole project - Nguyen-Anh CU
Lassana MAKADJI - Whole project - Lassana_Makadji
Rahima KONE - Whole project - Rahima_Koné
Mariem ROUISSI - Whole project - Mariem_Rouissi
Rebecca EHUA - Whole project - Rebecca_Ehua
This module takes place at the University of Rennes 1, ISTIC, in Master 1 (MIAGE). The objective of PDL is to carry out a software project with open technologies and data. There are many challenges to overcome, requiring skills in project management, modeling, and programming. This scenario should make it possible to better understand and apprehend the difficulty of developing software in an extremely concrete context. Software development technics and tools (git, github, Maven, JUnit, etc.) well known to the industry will be used. Technological choices will also have to be made.
This project is licensed under the MIT License - see the LICENSE.md file for details