bigscience-workshop / lam

Libraries, Archives and Museums (LAM)
Apache License 2.0
82 stars 7 forks source link

Add dataset: odeuropa_smell_objects #71

Open davanstrien opened 2 years ago

davanstrien commented 2 years ago

A URL for this dataset

https://doi.org/10.5281/zenodo.6367776

Dataset description

From the Zenodo page:

This dataset is released as part of the Odeuropa project. The annotations are identical to the training set of the ICPR2022-ODOR Challenge. It contains bounding box annotations for smell-active objects in historical artworks gathered from various digital connections. The smell-active objects annotated in the dataset either carry smells themselves or hint at the presence of smells. The dataset provides 15484 bounding boxes on 2116 artworks in 87 object categories. An additional csv file contains further image-level metadata such as artist, collection, or year of creation.

Object detection datasets are time consuming to collect and there are relativlely few datasets for object detection that use LAM data. Those that do exist often use the output of one of the various YOLO models which may be of some interest but often includes categories which are unlikely to be particularly useful for research/curation of LAM collections. This dataset, in contrast, includes categories related to smell: a topic of interest to both art historians and social historians. As a result, this dataset offers a much richer exploration of the possibilities of using object detection with historical paintings.

Dataset modality

Image

Dataset licence

Creative Commons Attribution 4.0 International

Other licence

No response

How can you access this data

Other

Confirm the dataset has an open licence

Contact details for data custodian

No response

davanstrien commented 2 years ago

Happy to help anyone who wants to work on this. I have a WIP loading script for another COCO formatted dataset: https://huggingface.co/datasets/biglam/nls_chapbook_illustrations

davanstrien commented 2 years ago

Also, I really want to call this dataset smelly_objects...

shamikbose commented 2 years ago

I'd love to work on this! Will be a good change from the text datasets so far.

shamikbose commented 2 years ago

self-assign

davanstrien commented 2 years ago

Awesome, and don't worry if you can't finish this before you go away. It can wait until you're back too 🙂

shamikbose commented 2 years ago

Hopefully, I should be able to get it done. From the Zenodo page:

Due to licensing issues, we cannot provide the images directly, but instead provide a collection of links and a download script.

Should the dataset just contain the links to the images then?

davanstrien commented 2 years ago

Hopefully, I should be able to get it done. From the Zenodo page:

Due to licensing issues, we cannot provide the images directly, but instead provide a collection of links and a download script.

Should the dataset just contain the links to the images then?

Yes I think that would be best for this one. We can provide example code for downloading the images in the datacard.

shamikbose commented 2 years ago

@davanstrien This dataset has a lot of associated metadata

       ['File Name', 'Artist', 'Title', 'Query', 'Part', 'Earliest Date',
       'Latest Date', 'Margin Years', 'Genre', 'Material', 'Medium',
       'Height of Image Field', 'Width of Image Field', 'Type of Object',
       'Height of Object', 'Width of Object', 'Diameter of Object',
       'Position of Depiction on Object', 'Current Location',
       'Repository Number', 'Original Location', 'Original Place',
       'Original Position', 'Context', 'Place of Discovery',
       'Place of Manufacture', 'Associated Scenes', 'Object Categories',
       'Related Works of Art', 'Type of Similarity', 'Inscription',
       'Text Source', 'Bibliography', 'Photo Archive', 'Image URL',
       'Details URL', 'Additional Information']

Should they all be included in the dataset? Most of them are missing, from a cursory glance at the data. Current Location, Earliest Date, Latest Date, Genre, Material and Medium are populated for most of the images. I was thinking some of the fields like Material and Medium could be used for classification, maybe

davanstrien commented 2 years ago

@davanstrien This dataset has a lot of associated metadata

       ['File Name', 'Artist', 'Title', 'Query', 'Part', 'Earliest Date',
       'Latest Date', 'Margin Years', 'Genre', 'Material', 'Medium',
       'Height of Image Field', 'Width of Image Field', 'Type of Object',
       'Height of Object', 'Width of Object', 'Diameter of Object',
       'Position of Depiction on Object', 'Current Location',
       'Repository Number', 'Original Location', 'Original Place',
       'Original Position', 'Context', 'Place of Discovery',
       'Place of Manufacture', 'Associated Scenes', 'Object Categories',
       'Related Works of Art', 'Type of Similarity', 'Inscription',
       'Text Source', 'Bibliography', 'Photo Archive', 'Image URL',
       'Details URL', 'Additional Information']

Should they all be included in the dataset? Most of them are missing, from a cursory glance at the data. Current Location, Earliest Date, Latest Date, Genre, Material and Medium are populated for most of the images. I was thinking some of the fields like Material and Medium could be used for classification, maybe

My own feeling would be to include as much as possible. One option if things are often missing would be to put some of this metadata in an additional metadata column as a dictionary? This way it doesn't get lost but also is slightly less distracting than having a lot of columns with mostly missing data?

shamikbose commented 2 years ago

Yeah, I was building out the features as follows:

features = datasets.Features(
                {
                    "id": datasets.Value("string"),
                    "url": datasets.Value("string"),
                    "annotations": datasets.Value("string"),
                    "date": datasets.Value("string"),
                    "genre": datasets.Value("string"),
                    "material": datasets.Value("string"),
                    "metadata": {
                        "artist": datasets.Value("string"),
                        "query": datasets.Value("string"),
                        "title": datasets.Value("string"),
                        "height": datasets.Value("string"),
                        "width": datasets.Value("string"),
                    }
                }
            )

I'll probably get back to this in about two weeks, after I come back from vacation

davanstrien commented 2 years ago

I'll probably get back to this in about two weeks, after I come back from vacation

Have a great vacation!

shamikbose commented 2 years ago

@davanstrien I'm back to working on this dataset, but it seems like the URLs aren't accessible. Even the download script provided in the dataset gives the following error: TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond Example from the first image in the metadata document: URL: http://www.sigecweb.beniculturali.it/images/fullsize/ICCD50007114/ICCD4644613_SBAS%20RM%20223305.jpg

davanstrien commented 2 years ago

@shamikbose hey, hope you had a good break!

I'll try and take a look at this too but also tagging @kiymetakdemir who works on this project and might be able to help with this.

shamikbose commented 2 years ago

@davanstrien I did! It was a much needed break Thanks for adding @kiymetakdemir. Hoping this data can still be accessed

kiymetakdemir commented 2 years ago

Hi @shamikbose, can you check it again? Now I tried to download the images with the given script but I haven't encountered any error, it downloaded successfully.

shamikbose commented 2 years ago

@kiymetakdemir I was able to download them today. Thanks!

shamikbose commented 2 years ago

@kiymetakdemir I get an error for this URL (http://134.76.24.240/download/07876601/flc0596164z_p?Expires=1610722060&Signature=SX15SE0B~KbZ7yvkTJtis1rsKysZddvhsxJzZSZ7oZoxqd~NNsKp22iYZGBQViGXMy7zwTDCYxu-Qan2O0aq2QxizENey~CF4WIV5-~bHwEZZjrmCoBdWDEeS0Y6XNajZ6DYzWQolxkiGWoqLs~Bw0j4GSrQef7QvgQciIWDlTE_&Key-Pair-Id=APKAJGHHKKX2FHRP63AQ) It's not accessible Update: The links from www.sigecweb.beniculturali.it are timing out again