MegaScenes / dataset

70 stars 0 forks source link

MegaScenes Dataset v1.0

Paper | Arxiv | NVS Code | Project Page

The MegaScenes Dataset is an extensive collection of around 430K scenes and 9M images and epipolar geometries, featuring over 100K structure-from-motion reconstructions from 2M of these images. The images of these scenes are captured under varying conditions, including different times of day, various weather and illumination, and from different devices with distinct camera intrinsics.

To view reconstructions in the browser, see our Web Viewer!

We provide a datasheet for MegaScenes here.

If you find our dataset or paper useful, please consider citing

@inproceedings{
      tung2024megascenes,
      title={MegaScenes: Scene-Level View Synthesis at Scale}, 
      author={Tung, Joseph and Chou, Gene and Cai, Ruojin and Yang, Guandao and Zhang, Kai and Wetzstein, Gordon and Hariharan, Bharath and Snavely, Noah},
      booktitle={ECCV},
      year={2024}
    }

Data Access

The MegaScenes Dataset is hosted on Amazon S3 thanks to the AWS Open Data Sponsorship Program.

Specifically, MegaScenes uses the AWS S3 bucket URL s3://megascenes/ in the US-West-2 AWS Region.

All files can be individually downloaded. They are not chunked into .tar or .zip files.

Dataset Access via Command-line Interface (recommended)

Users can access the dataset using s5cmd or AWS CLI. These are locally installed command-line interfaces that can access datasets on AWS. Both CLI's have very similar commands, so an s5cmd command can typically be converted to an AWS CLI command by replacing the prefix s5cmd with aws s3.

In this section, we will share some s5cmd commands.

How to Download MegaScenes to Local Disk

To copy a file or a directory from AWS to local disk, use this command: s5cmd --no-sign-request cp <source_bucket_url> <local_dest>

Alternatively, sync can be used instead of cp. sync additionally checks for differences between AWS and the locally downloaded dataset.

[!IMPORTANT] If the source URL is a directory, then it must have a wildcard (*).

Example 1: Download the entire MegaScenes dataset to local disk

This command will download the entire dataset to a local folder called MegaScenes/.

s5cmd --no-sign-request cp s3://megascenes/* ./MegaScenes/

Example 2: Download a specific directory to local disk

This command recursively downloads the contents of the images folder from AWS into the local folder MegaScenes/images/:

s5cmd --no-sign-request cp s3://megascenes/images/* ./MegaScenes/images/

Example 3: Download a single file to local disk

This command downloads a specific database.db file from AWS into its respective local folder:

s5cmd --no-sign-request cp s3://megascenes/databases/main/000/000/database.db ./MegaScenes/databases/main/000/000/database.db

Downloading subsets of MegaScenes

It is possible to use s5cmd to define subsets of MegaScenes to download; this is done with s5cmd run with a text file of s5cmd commands. For more information, see s5cmd's documentation on running multiple commands in parallel.

List MegaScenes directory contents on AWS

List directory contents on AWS: s5cmd --no-sign-request ls <bucket_url> This command is helpful to see what items are in each directory before downloading them to the local machine.

Example

This command lists the contents of the database/ subfolder on AWS.

Input:

s5cmd --no-sign-request ls s3://megascenes/databases/

Output:

                                  DIR  descriptors/
                                  DIR  main/

Other Notes

The --no-sign-request flag is for the user to access the AWS bucket without the need to create and supply AWS credentials.

For other commands, please see the s5cmd or AWS CLI documentation.

Dataset Access via HTTP

Singular files can be downloaded over HTTP (via wget or curl) using the base URL https://megascenes.s3.us-west-2.amazonaws.com/.

For instance, https://megascenes.s3.us-west-2.amazonaws.com/metadata/subcat/000/007/subcats.json is a direct download for the subcategory information for scene-ID 7.

Dataset Layout

The bucket's directory tree is as follows:

A scene is represented by its zero-padded six-digit scene-ID number as described in Scene Folders in applicable subdirectories. A directory that links scene name to scene-ID can be found at: s3://megascenes/metadata/categories.json. For details on subfolder contents, see the respective sections below.

databases/ Directory

This directory houses COLMAP databases for each scene. COLMAP databases contain tabulated information on images, keypoints, descriptors, matches, and estimated two-view geometries. COLMAP databases use the SQLite format.

The database/ directory is broken into two subdirectories:

In the two above subdirectories, a scene is represented by its scene-ID number as described in Scene Folders.

Partitioned Databases

For each scene, the COLMAP database is partitioned into two files:

We separate the Descriptors table since it takes the majority of space in the COLMAP database, and may not contain relevant information for certain applications.

Example

For a scene with ID 1234, the database files are as follows:

images/ Directory

This directory houses images and image metadata for each scene. A scene is represented by its scene-ID number as described in Scene Folders.

The images/ directory is 3.2 TB.

JSON Contents

A scene can have any number of subcategories. Each subcategory contains images, a raw_metadata.json, a category.json, and a 0/category.json.

Image metadata is represented in raw_metadata.json. This json contains a key for each image name, and contains information of various data extracted from Wikimedia Commons, including EXIF data and licensing information.

The scene subcategory name resides in subcategory_name/category.json.

A list of image names reside in subcategory_name/0/category.json.

Example

For a scene with ID 1234, the image files are as follows:

metadata/ Directory

This directory houses metadata for the dataset.

The metadata/ directory has the following contents:

Subcategory Information

Subcategory information resides in the metadata/subcat/ directory. This directory is organized by scene-ID number as described in Scene Folders.

A scene is present in metadata/subcat/ only if it has at least one category besides the main category. Such a scene will have a subcats.json to represent the subcategory data.

A subcats.json file is a dictionary that contains the following fields:

Example

The category Arco degli Argentari has a scene-ID of 7. The subcategory information for this scene is at s3://megascenes/metadata/subcat/000/007/subcats.json, and has the following contents:

{
    "main_category": "Arco_degli_Argentari",
    "graph": {
        "Arco_degli_Argentari": [
            "Arco_degli_Argentari_in_art",
            "Historical_images_of_the_Arco_degli_Argentari"
        ],
        "Arco_degli_Argentari_in_art": [],
        "Historical_images_of_the_Arco_degli_Argentari": [
            "Arco_degli_Argentari_in_art"
        ]
    },
    "frontier": []
}

Here, the graph shows that the main category Arco degli Argentari has two subcategories: Arco degli Argentari in art and Historical images of the Arco degli Argentari. The category Arco degli Argentari in art has no subcategories, hence the empty list. In contrast, the category Historical images of the Arco degli Argentari has the subcategory Arco degli Argentari in art.

The frontier list is empty, meaning that this subcategory graph is expanded in its entirety.

Index of Images

We provide an table that indexes the images in MegaScenes at s3://megascenes/metadata/images_index.parquet (HTTPS download) (~230 MB). Parquet files store tabular data like CSV files, but are more compact and faster to read. They can be read using Python dataframe libraries, such as Polars (recommended) or Pandas. This table contains over 8 million rows, each representing an image in the dataset. The columns are:

The respective Wikimedia Commons page for an image is at the URL https://commons.wikimedia.org/wiki/File:{image_name}. Likewise, the respective Wikimedia Commons page for a category is at the URL https://commons.wikimedia.org/wiki/Category:{cat or subcat}.

While this table contains the parsed licensing information from Wikimedia Commons, we encourage the user to verify the image licenses themselves.

Wikidata Entries

The wikidata/ subcategory is organized by Wikidata Q-ID. The first three digits of the Q-ID define the three subfolders that the Wikidata JSON information can be found in. If the Q-ID has less than three digits, then its JSON resides in the other/ folder. Unlike the scene IDs, this number is NOT zero-padded.

Examples

The JSON for a Wikidata item with Q-ID Q1234 is located at metadata/wikidata/1/2/3/Q1234.json.

The JSON for a Wikidata item with Q-ID Q12 is located at metadata/wikidata/other/Q12.json.

Resources

For JSON documentation, see this page on Wikibase JSON.

For additional tools to parse this JSON, see this Wikidata page on Data access.

reconstruct/ Directory

This directory contains the COLMAP sparse point cloud reconstructions for each scene. The reconstruct/ directory is organized by scenes, according to a scene-ID number as described in Scene Folders. Each reconstruction consists of an images.bin, cameras.bin, and points3D.bin as described here. A scene may have zero or more reconstructions; the reconstruct/ folder only contains scenes with one or more.

The reconstruct/ folder is 429 GB.

Example

Suppose a scene with ID 1234 has three reconstructions. In this scene's sparses/ folder, there will be three folders numbered from 0 to 2.

Specifically, the format is as follows:

Visualizing Reconstructions

The sparse reconstructions in MegaScenes can be viewed using our web viewer.

Alternatively, reconstructions can be viewed locally using the COLMAP GUI (requires a COLMAP installation).

Loading Reconstructions in Scripts

The reconstructions can be loaded in Python using the read_write_model.py script from the COLMAP repository. Specifically, the helpful functions are: read_model, read_points3D_binary, read_images_binary, read_cameras_binary

Scene Folders

The dataset uses a system of two subfolders to divide scenes, where each scene has a scene-ID number. The first subfolder uses the first three digits of the 6-digit zero-padded scene ID. The second subfolder uses the last three digits. The data associated with the scene resides in the latter subfolder.

For example:

Each scene is based off of a category from Wikimedia Commons. For instance, the scene "Arc_de_Triomphe_de_l'Étoile" uses images from Category:Arc de Triomphe de l'Étoile and its subcategories. MegaScenes use underscores instead of spaces for scene names, but they are interchangable when used in Wikimedia Commons URLs.

The file s3://megascenes/metadata/categories.json (HTTP Link) links the category name to the scene-ID.

Contributions, Issues, and Suggestions

If you find any incorrect reconstructions or have improvements for the dataset, please create an GitHub issue or discussion post.

License

This dataset is licensed under the Creative Commons Attribution 4.0 International License. The photos in the images/ folder have their own licenses.