Paper | Arxiv | Project Page
The MegaScenes Dataset is an extensive collection of around 430K scenes and 9M images and epipolar geometries, featuring over 100K structure-from-motion reconstructions from 2M of these images. The images of these scenes are captured under varying conditions, including different times of day, various weather and illumination, and from different devices with distinct camera intrinsics.
To view reconstructions in the browser, see our Web Viewer!
The MegaScenes Dataset is hosted on Amazon S3 thanks to the AWS Open Data Sponsorship Program.
Specifically, MegaScenes uses the AWS S3 bucket URL s3://megascenes/
in the US-West-2
AWS Region.
Users can access the dataset using s5cmd or AWS CLI. These are locally installed command-line interfaces that can access datasets on AWS.
Both CLI's have very similar commands, so an s5cmd command can typically be converted to an AWS CLI command by replacing the prefix s5cmd
with aws s3
.
In this section, we will share some s5cmd commands.
Copy a file or a directory locally: s5cmd --no-sign-request cp <source_bucket_url> <local_dest>
Alternatively, sync
can be used instead of cp
. sync
additionally checks for differences between AWS and the locally downloaded dataset.
If the source URL is a directory, then it must have a wildcard (*
).
This command recursively downloads the contents of the images
folder from AWS into the local folder MegaScenes/images/
:
s5cmd --no-sign-request cp s3://megascenes/images/* ./MegaScenes/images/
This command downloads a specific database.db
file from AWS into its respective local folder:
s5cmd --no-sign-request cp s3://megascenes/databases/main/000/000/database.db ./MegaScenes/databases/main/000/000/database.db
List directory contents: s5cmd --no-sign-request ls <bucket_url>
Input
s5cmd --no-sign-request ls s3://megascenes/databases/
Output
DIR descriptors/
DIR main/
The --no-sign-request
flag is for the user to access the AWS bucket without the need to create and supply AWS credentials.
Singular files can be downloaded over HTTP (via wget
or curl
) using the base URL https://megascenes.s3.us-west-2.amazonaws.com/
.
For instance, https://megascenes.s3.us-west-2.amazonaws.com/metadata/subcat/000/007/subcats.json is a direct download for the subcategory information for scene-ID 7
.
The bucket's directory tree is as follows:
s3://megascenes/
or https://megascenes.s3.us-west-2.amazonaws.com/
databases/
main/
000/000/
. . . 458/152/
descriptors/
000/000/
. . . 458/152/
images/
000/000/
. . . 458/152/
metadata/
subcat/
000/000/
. . . 458/148/
wikidata/
0/0/0/
. . . 9/9/9/
, other/
reconstruct/
000/000/
. . . 458/150/
README.md
A scene is represented by its zero-padded six-digit scene-ID number as described in Scene Folders in applicable subdirectories. A directory that links scene name to scene-ID can be found at: s3://megascenes/metadata/categories.json
. For details on subfolder contents, see the respective sections below.
databases/
DirectoryThis directory houses COLMAP databases for each scene. COLMAP databases contain tabulated information on images, keypoints, descriptors, matches, and estimated two-view geometries. COLMAP databases use the SQLite format.
The database/
directory is broken into two subdirectories:
main/
(1.9 TB), which contains database.db
filesdescriptors/
(6.8 TB compressed, 8.3 TB uncompressed), which contains descriptors.db.gz
filesIn the two above subdirectories, a scene is represented by its scene-ID number as described in Scene Folders.
For each scene, the COLMAP database is partitioned into two files:
database.db
, which is the COLMAP database without the Descriptors table.descriptors.db.gz
, which is the Descriptors table extracted from the COLMAP database as its own SQLite database. It is compressed with gzip.We separate the Descriptors table since it takes the majority of space in the COLMAP database, and may not contain relevant information for certain applications.
For a scene with ID 1234
, the database files are as follows:
databases/main/001/234/database.db
databases/descriptors/001/234/descriptors.db.gz
images/
DirectoryThis directory houses images and image metadata for each scene. A scene is represented by its scene-ID number as described in Scene Folders.
The images/
directory is 3.2 TB.
A scene can have any number of subcategories. Each subcategory contains images, a raw_metadata.json
, a category.json
, and a 0/category.json
.
Image metadata is represented in raw_metadata.json
. This json contains a key for each image name, and contains information of various data extracted from Wikimedia Commons, including EXIF data and licensing information.
The scene subcategory name resides in subcategory_name/category.json
.
A list of image names reside in subcategory_name/0/category.json
.
For a scene with ID 1234
, the image files are as follows:
images/
001/234/
commons/
subcategory_name_1/
category.json
raw_metadata.json
0/
category.json
pictures/
image1.jpg
image2.jpg
subcategory_name_2/
category.json
raw_metadata.json
0/
category.json
pictures/
image1.jpg
image2.jpg
metadata/
DirectoryThis directory houses metadata for the dataset.
The metadata/
directory has the following contents:
subcat/
(386 MB), which is a directory that contains JSON files of subcategory information for scenes with at least one subcategorywikidata/
(4.5 GB), which is a directory contains JSON files for all Wikidata entries related to a scene or their heirarchical classescategories.json
(19.2 MB), which is dictionary that maps a Wikimedia Commons category name to a scene-ID.Subcategory information resides in the metadata/subcat/
directory. This directory is organized by scene-ID number as described in Scene Folders.
A scene is present in metadata/subcat/
only if it has at least one category besides the main category. Such a scene will have a subcats.json
to represent the subcategory data.
A subcats.json
file is a dictionary that contains the following fields:
main_category
: a string of the name of the Wikimedia Commons top-level category.graph
: a dictionary mapping a Wikimedia Commons category to a list of its direct subcategories. A category will be a key in graph
if it has been visited. An empty list means that the category has no subcategories. frontier
: a list of subcategories present in graph
that have not been expanded to have its own key in graph
.The category Arco degli Argentari has a scene-ID of 7
. The subcategory information for this scene is at s3://megascenes/metadata/subcat/000/007/subcats.json
, and has the following contents:
{
"main_category": "Arco_degli_Argentari",
"graph": {
"Arco_degli_Argentari": [
"Arco_degli_Argentari_in_art",
"Historical_images_of_the_Arco_degli_Argentari"
],
"Arco_degli_Argentari_in_art": [],
"Historical_images_of_the_Arco_degli_Argentari": [
"Arco_degli_Argentari_in_art"
]
},
"frontier": []
}
Here, the graph shows that the main category Arco degli Argentari has two subcategories: Arco degli Argentari in art and Historical images of the Arco degli Argentari. The category Arco degli Argentari in art has no subcategories, hence the empty list. In contrast, the category Historical images of the Arco degli Argentari has the subcategory Arco degli Argentari in art.
The frontier list is empty, meaning that this subcategory graph is expanded in its entirety.
The wikidata/
subcategory is organized by Wikidata Q-ID. The first three digits of the Q-ID define the three subfolders that the Wikidata JSON information can be found in. If the Q-ID has less than three digits, then its JSON resides in the other/
folder. Unlike the scene IDs, this number is NOT zero-padded.
The JSON for a Wikidata item with Q-ID Q1234
is located at metadata/wikidata/1/2/3/Q1234.json
.
The JSON for a Wikidata item with Q-ID Q12
is located at metadata/wikidata/other/Q12.json
.
For JSON documentation, see this page on Wikibase JSON.
For additional tools to parse this JSON, see this Wikidata page on Data access.
reconstruct/
DirectoryThis directory contains the COLMAP sparse point cloud reconstructions for each scene. The reconstruct/
directory is organized by scenes, according to a scene-ID number as described in Scene Folders. Each reconstruction consists of an images.bin
, cameras.bin
, and points3D.bin
as described here. A scene may have zero or more reconstructions; the reconstruct/
folder only contains scenes with one or more.
The reconstruct/
folder is 429 GB.
Suppose a scene with ID 1234
has three reconstructions. In this scene's sparses/
folder, there will be three folders numbered from 0
to 2
.
Specifically, the format is as follows:
reconstruct/
001/234/
sparses/
0/
images.bin
cameras.bin
points3D.bin
1/
images.bin
cameras.bin
points3D.bin
2/
images.bin
cameras.bin
points3D.bin
The dataset uses a system of two subfolders to divide scenes; each scene has a scene-ID number. The first subfolder uses the first three digits of the 6-digit zero-padded scene ID. The second subfolder uses the last three digits. The data associated with the scene resides in the latter subfolder.
For example:
533
, it is zero-padded to 000533
. This number translates to the directory 000/533/
.422678
, it translates to the directory 422/678/
.Each scene is based off of a category from Wikimedia Commons. For instance, the scene "Arc_de_Triomphe_de_l'Étoile" uses images from Category:Arc de Triomphe de l'Étoile and its subcategories. MegaScens use underscores instead of spaces for scene names, but they are interchangable when used in Wikimedia Commons URLs.
The file s3://megascenes/metadata/categories.json
(HTTP Link) links the category name to the scene-ID.
We are continually looking for ways to improve the dataset. If you find any incorrect reconstructions, please create an GitHub issue here or discussion post here.
This dataset is licensed under the Creative Commons Attribution 4.0 International License. The photos in the images/
folder have their own licenses.