disheng / DEXTER

Dataset collected by DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web
GNU General Public License v2.0
7 stars 2 forks source link

DEXTER

DEXTER is a research project designed to discover and extraction product specifications on the Web.

This repository provides information to access the DEXTER Dataset described in VLDB2015 Research paper:

DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web link

In this repository you can find the output dataset generated by DEXTER, the dataset if organized as follows:

Specification Output XML:

We provide under output-xml a dump of the specifications of the discovered products. Each file is a compressed (.7z) file that contains an XML dump with all the discovered products for a specific category.

The XML dump follows this structure:

<products>
    <product>
        <site>www.amazon.com</site>
        <category>camera</category>
        <url>http://www.amazon.com/...</url>
        <attribute_1>value_attribute_1</attribute_1>
        ...
    </product>
    <product>
        ...
    </product>
    ...
</products>

To each product we have added three additional attributes: URL from which the specification has been extracted, the category associated to the page and the website.

Dataset

The dataset presents HTML pages collected by our focused crawler. The dataset is organised under the bucket dexter-pages in the following folders:

  1. data
  2. dexter_sources
  3. dataset_local_categories.json

Data

Under /data/*

The folder is organised in subfolder, a subfolder for each crawled website. Pages of a given website are stored as .gz files. Pages are stored with an incremental file number \.txt.gz and the mapping between dumped file and original url is under an index.txt file.

The index.txt file stores in each line a tab separated pair. Pairs are organised in \.txt and .

An example is:

1.txt   http://www.sample_website.com/productAAAA
2.txt   http://www.sample_website.com/productBBBB
3.txt   http://www.sample_website.com/productCCCC

Example of index file Link

Dexter Sources

Under /dexter_sources/*

We provide also the output of the Dexter classification. Page urls are grouped in sources (pair \<category,website>), the folder contains a single json file for each DEXTER classified source.

Files are named with the following pattern: \_\.json

File contains for each website a map with the following information:

  1. "\": list of pages urls
  2. "entry_page": list of category entry page
  3. "pages_number": number of pages

Example of Dexter category file Link

Dataset Local Categories Link

In dataset_local_categories.json

We present the locale categories crawled directly from the discovered websites. The file is a nested json that is organised as follows:

{
    "site1": {
        "category_1": [
            url1,
            url2,
            ...
        ]
        ...
    }, 
    "site2": {
    ...
    }