chemy8 / PDL_2020-2021_GR7

Projet de Développement Logiciel "Wikipédia Matrix" ayant pour objectif d'extraire des tableaux au format CSV à partir de pages Wikipédia.
0 stars 2 forks source link

Wikipedia Matrix : The Truth

Quick illustration of the project

The aim of this PDL project is to extract tables in CSV format from Wikipedia pages. Those pages can be analyzed in two different ways: By searching for the corresponding Wikitext code By exploiting the HTML rendering of the Wikipedia page

Both approaches would be compared and tested (in order to have the same CSV output).

But why extract tables in Wikipedia?

Wikipedia tables are difficult to exploit by statistical tools, visualization or any tool able to exploit tables (e.g., Excel, OpenOffice, RStudio, Jupyter). These tables are written in a syntax (Wikitext) difficult to analyze and not necessarily designed for the specification of tables. In addition, there is a strong heterogeneity in the way tables are written, further complicating Wikipedia's tabular data processing. Same can be said for HTML format.

Why CSV (Comma separated values) ?

It is very simple and above all supported by many tools.

This project is about implementing a solution and specify a ground truth ("ground truth") and thus evaluate different extractors by confronting them to the ground truth. Also, it must be able to extract several tables on the same Wikipedia page.

Last but not least, this project will propose a set of tools able to analyze the results of the extractors and thus specify a set of expected results (which will then be used during the automatic test phase). Among these tools, one of them will allows to visualize a matrix (resulting from an automatic extractor), possibly to correct the matrix, and then to export it in CSV format.

Finally, a most global suite of tests will demonstrate the quality of our tool.

Final result

There will be three concrete results: Extractors of much better quality (with source code, documentation, test suite, continuous integration, etc.) A suite of tools to be able to more easily specify a ground truth and thus help the evaluation of extractors A dataset reusable by anyone wanting to test an array extractor

This is a Master of Business Informatics' project which improve its latest version called "Wikipedia Matrix" (this actuel project has been forked from this one)
(cf : https://github.com/mathieulehan/PDL_2018-2019_GR4).

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See "Parsing wikitables" to start the parsing on a live system.

After cloning the project into your computer, either you open it on your IDE or just run Maven tests. It will run the test classes and then parse tables from more than 300 wikipedia urls.

You can find a demo right here.

Prerequisites

An IDE and Maven. JDK 8 to execute maven test.

IntelliJ - https://www.jetbrains.com/idea/
Maven - https://maven.apache.org

Installing

How to install it ?

Clone it from git into your computer on your terminal with the following line

git clone https://github.com/Qt-tracker/PDL_2019-2020_GR5.git

Don't forget to convert your project to a Maven one

You are done !

You can find more details in INSTALL.md.

Folders' structure

Folders:

Running the tests

On IntelliJ : Right click on your project, then choose Run 'All Tests' To run the test with 336 URL choose the file BenchTest.java

Parsing wikitables

For extracting from both HTML and WIKITEXT, you need to go:

On IntelliJ : Double click on the green play button on your right.

You can extract the way you like.

Run the class WikiExtractMain. Then type :
- W to parse files from WIKITEXT to csv
- H to parse files from HTML to csv
- X to parse files from both WIKITEXT and HTML

Supported and unsupported features (actual state)

Extraction via wikitext does not work very well, especially table checking. Json Format is making trouble while extracting a table. However via HTML we do not encounter any problems.
If there is a table under a table, the CSV given is not valid. Moreover, many times, extraction via wikitext compare to those via html do not give the same result.

Some little problems have been found as when a false URL is given, it pops out an error without precising which url/title is making trouble. When a page does not have any table it is not said clearly. Until then there is no method to check if the CSV is good, so this is a new method considered. A method that compares two CSVs is well under consideration. Also user can't choice the URL , we put a default URL in the main for extraction

Future function

choice of URL by user Another extractor more perform in other langage

Built With

Authors

Project Context

This module takes place at the University of Rennes 1, ISTIC, in Master 1 (MIAGE). The objective of PDL is to carry out a software project with open technologies and data. There are many challenges to overcome, requiring skills in project management, modeling, and programming. This scenario should make it possible to better understand and apprehend the difficulty of developing software in an extremely concrete context. Software development technics and tools (git, github, Maven, JUnit, etc.) well known to the industry will be used. Technological choices will also have to be made.

License

This project is licensed under the MIT License - see the LICENSE.md file for details