MegaScenes / dataset

46 stars 0 forks source link

MegaScenes Dataset v1.0

Paper | Arxiv | Project Page

The MegaScenes Dataset is an extensive collection of around 430K scenes and 9M images and epipolar geometries, featuring over 100K structure-from-motion reconstructions from 2M of these images. The images of these scenes are captured under varying conditions, including different times of day, various weather and illumination, and from different devices with distinct camera intrinsics.

To view reconstructions in the browser, see our Web Viewer!

Data Access

The MegaScenes Dataset is hosted on Amazon S3 thanks to the AWS Open Data Sponsorship Program.

Specifically, MegaScenes uses the AWS S3 bucket URL s3://megascenes/ in the US-West-2 AWS Region.

Dataset Access via Command-line Interface (recommended)

Users can access the dataset using s5cmd or AWS CLI. These are locally installed command-line interfaces that can access datasets on AWS. Both CLI's have very similar commands, so an s5cmd command can typically be converted to an AWS CLI command by replacing the prefix s5cmd with aws s3.

In this section, we will share some s5cmd commands.

Downloading MegaScenes

Copy a file or a directory locally: s5cmd --no-sign-request cp <source_bucket_url> <local_dest>

Alternatively, sync can be used instead of cp. sync additionally checks for differences between AWS and the locally downloaded dataset.

Example 1: Downloading a specific directory locally

If the source URL is a directory, then it must have a wildcard (*).

This command recursively downloads the contents of the images folder from AWS into the local folder MegaScenes/images/:

s5cmd --no-sign-request cp s3://megascenes/images/* ./MegaScenes/images/

Example 2: Downloading a single file locally

This command downloads a specific database.db file from AWS into its respective local folder:

s5cmd --no-sign-request cp s3://megascenes/databases/main/000/000/database.db ./MegaScenes/databases/main/000/000/database.db

Listing MegaScenes directory contents

List directory contents: s5cmd --no-sign-request ls <bucket_url>

Example

Input

s5cmd --no-sign-request ls s3://megascenes/databases/

Output

                                  DIR  descriptors/
                                  DIR  main/

Other Notes

The --no-sign-request flag is for the user to access the AWS bucket without the need to create and supply AWS credentials.

Dataset Access via HTTP

Singular files can be downloaded over HTTP (via wget or curl) using the base URL https://megascenes.s3.us-west-2.amazonaws.com/.

For instance, https://megascenes.s3.us-west-2.amazonaws.com/metadata/subcat/000/007/subcats.json is a direct download for the subcategory information for scene-ID 7.

Dataset Layout

The bucket's directory tree is as follows:

A scene is represented by its zero-padded six-digit scene-ID number as described in Scene Folders in applicable subdirectories. A directory that links scene name to scene-ID can be found at: s3://megascenes/metadata/categories.json. For details on subfolder contents, see the respective sections below.

databases/ Directory

This directory houses COLMAP databases for each scene. COLMAP databases contain tabulated information on images, keypoints, descriptors, matches, and estimated two-view geometries. COLMAP databases use the SQLite format.

The database/ directory is broken into two subdirectories:

In the two above subdirectories, a scene is represented by its scene-ID number as described in Scene Folders.

Partitioned Databases

For each scene, the COLMAP database is partitioned into two files:

We separate the Descriptors table since it takes the majority of space in the COLMAP database, and may not contain relevant information for certain applications.

Example

For a scene with ID 1234, the database files are as follows:

images/ Directory

This directory houses images and image metadata for each scene. A scene is represented by its scene-ID number as described in Scene Folders.

The images/ directory is 3.2 TB.

JSON Contents

A scene can have any number of subcategories. Each subcategory contains images, a raw_metadata.json, a category.json, and a 0/category.json.

Image metadata is represented in raw_metadata.json. This json contains a key for each image name, and contains information of various data extracted from Wikimedia Commons, including EXIF data and licensing information.

The scene subcategory name resides in subcategory_name/category.json.

A list of image names reside in subcategory_name/0/category.json.

Example

For a scene with ID 1234, the image files are as follows:

metadata/ Directory

This directory houses metadata for the dataset.

The metadata/ directory has the following contents:

Subcategory Information

Subcategory information resides in the metadata/subcat/ directory. This directory is organized by scene-ID number as described in Scene Folders.

A scene is present in metadata/subcat/ only if it has at least one category besides the main category. Such a scene will have a subcats.json to represent the subcategory data.

A subcats.json file is a dictionary that contains the following fields:

Example

The category Arco degli Argentari has a scene-ID of 7. The subcategory information for this scene is at s3://megascenes/metadata/subcat/000/007/subcats.json, and has the following contents:

{
    "main_category": "Arco_degli_Argentari",
    "graph": {
        "Arco_degli_Argentari": [
            "Arco_degli_Argentari_in_art",
            "Historical_images_of_the_Arco_degli_Argentari"
        ],
        "Arco_degli_Argentari_in_art": [],
        "Historical_images_of_the_Arco_degli_Argentari": [
            "Arco_degli_Argentari_in_art"
        ]
    },
    "frontier": []
}

Here, the graph shows that the main category Arco degli Argentari has two subcategories: Arco degli Argentari in art and Historical images of the Arco degli Argentari. The category Arco degli Argentari in art has no subcategories, hence the empty list. In contrast, the category Historical images of the Arco degli Argentari has the subcategory Arco degli Argentari in art.

The frontier list is empty, meaning that this subcategory graph is expanded in its entirety.

Wikidata Entries

The wikidata/ subcategory is organized by Wikidata Q-ID. The first three digits of the Q-ID define the three subfolders that the Wikidata JSON information can be found in. If the Q-ID has less than three digits, then its JSON resides in the other/ folder. Unlike the scene IDs, this number is NOT zero-padded.

Examples

The JSON for a Wikidata item with Q-ID Q1234 is located at metadata/wikidata/1/2/3/Q1234.json.

The JSON for a Wikidata item with Q-ID Q12 is located at metadata/wikidata/other/Q12.json.

Resources

For JSON documentation, see this page on Wikibase JSON.

For additional tools to parse this JSON, see this Wikidata page on Data access.

reconstruct/ Directory

This directory contains the COLMAP sparse point cloud reconstructions for each scene. The reconstruct/ directory is organized by scenes, according to a scene-ID number as described in Scene Folders. Each reconstruction consists of an images.bin, cameras.bin, and points3D.bin as described here. A scene may have zero or more reconstructions; the reconstruct/ folder only contains scenes with one or more.

The reconstruct/ folder is 429 GB.

Example

Suppose a scene with ID 1234 has three reconstructions. In this scene's sparses/ folder, there will be three folders numbered from 0 to 2.

Specifically, the format is as follows:

Scene Folders

The dataset uses a system of two subfolders to divide scenes; each scene has a scene-ID number. The first subfolder uses the first three digits of the 6-digit zero-padded scene ID. The second subfolder uses the last three digits. The data associated with the scene resides in the latter subfolder.

For example:

Each scene is based off of a category from Wikimedia Commons. For instance, the scene "Arc_de_Triomphe_de_l'Étoile" uses images from Category:Arc de Triomphe de l'Étoile and its subcategories. MegaScens use underscores instead of spaces for scene names, but they are interchangable when used in Wikimedia Commons URLs.

The file s3://megascenes/metadata/categories.json (HTTP Link) links the category name to the scene-ID.

Contributions, Issues, and Suggestions

We are continually looking for ways to improve the dataset. If you find any incorrect reconstructions, please create an GitHub issue here or discussion post here.

License

This dataset is licensed under the Creative Commons Attribution 4.0 International License. The photos in the images/ folder have their own licenses.