bigscience-workshop / lam

Libraries, Archives and Museums (LAM)
Apache License 2.0
79 stars 6 forks source link

Add dataset: biodiversity_heritage_library #72

Open nabsiddiqui opened 1 year ago

nabsiddiqui commented 1 year ago

A URL for this dataset

https://www.flickr.com/photos/biodivlibrary/

Dataset description

Dataset of images uploaded from the Biodiversity Heritage Library hosted at the Smithsonian Institution. Useful for GANs, understanding the history of biodiversity, understanding changing aesthetic values over time, etc.

Dataset modality

Image

Dataset licence

Creative Commons Attribution Non Commercial Share Alike 4.0 International

Other licence

No response

How can you access this data

As a download from a repository/website

Confirm the dataset has an open licence

Contact details for data custodian

No response

nabsiddiqui commented 1 year ago

@MikeTrizna, @cceyda, and @shamikbose. This is the new repository for the biodiversity heritage library Flickr.

@MikeTrizna, I think the easiest way is for me to download the images and extract the EXIF data using R. It shouldn't take me too long. I will then upload it to Hugging Face. After that, I think the thing to do is create a variety of metadata that we can just upload in a CSV. This is where you could add the embeddings information. My goal is to have something like the following:

Image File Description Keyword Embeddings Labels from VGNet 16 Labels from VGNet 19 Other Meta Data

I agree that imagenet is not ideal but we could use memespector to run through the images just in case. I haven't used memespector with so many images before so lets see how it goes.

I will try to think of some ideas for the Journal of Open Humanities Data article and create a Google Doc after that and invite everyone to contribute.

davanstrien commented 1 year ago

Sorry for coming to this discussion bit late. My preference would be to include the tags/descriptions generated by cataloguers/Flickr users rather than add labels output from a generic ML model.

For example, this image https://www.flickr.com/photos/biodivlibrary/7268619716/in/photolist-2kM8xSd-9vVsjU-2kM8xXD-XGMoPR-2kF7kHZ-wnaCYy-2jAiBTp-bu7ij8-c5izpA-bvSk4u-atGxWy-atDT9z-aUT4DB-atGxQS-akH6mL-d9s1hq-2m8ksrw-vGLxu7-wEgNjr-eV4dYa-wBsXJj-2j1mFUu-wEgJuH-wnhPxM-wEgKMH-ag15Gu-ag15CN-bK4bht contains the following EXIF data on Flickr

JFIFVersion - 1.02
X-Resolution - 96 dpi
Y-Resolution - 96 dpi
XMPToolkit - Image::ExifTool 12.30
About - uuid:faf5bdd5-ba3d-11da-ad31-d33d75182f1b
Creator - Craig, Hugh, ed.
Description - Johnson's household book of nature,. New York,H.J.Johnson,[1880]. http://biodiversitylibrary.org/page/39741740
Rights - Public Domain
Subject - Mammals
Credit - Image courtesy of BHL
Source - http://biodiversitylibrary.org 

and has the following tags:

Screenshot 2022-07-28 at 11 15 52

Although that information is noisy, I think it is more useful and could be used to evaluate noisy supervision and/or contrastive learning methods.

I'm not familiar with memespector but from looking at the GitHub for the tool, it seems like this offers a GUI to run images through one of the commercial vision platforms?

I just put an example through the Google API:

Screenshot 2022-07-28 at 11 18 35

Although this prediction isn't 'wrong', it's also arguably not that useful. I'm also quite wary of using commercial platforms like this because it's often difficult/impossible to get a complete list of possible labels they can assign to an image. Potentially something that could be evaluated using the existing metadata for the images is how closely the predictions of one of the commercial API vision services match the labels assigned on Flickr?

Maybe it makes sense to see what @MikeTrizna already has prepared and take it from there?

nabsiddiqui commented 1 year ago

@davanstrien That makes sense. It turns out my ISP has a limit on downloading so I can't download all the images. @MikeTrizna will have to upload them.

I think what @MikeTrizna has is the metadata of the files, which include the EXIF data along with some Flickr API stuff. I will go ahead and clean that data up and make it a CSV file. I'll put the results here.

In terms of memespector. There are the commercial APIs it provides along with free versions like VGNet, etc. It can also do a lot of more interesting network stuff. See http://tallerdeletras.letras.uc.cl/index.php/Disena/article/view/27271/33509.

The open source options provide labels of the items, but for 300k images, it is probably too much. I will use that for a different dataset perhaps in the future. For this, I think the metadata should be sufficient. I will start working on that.

MikeTrizna commented 1 year ago

I don't think I ever mentioned that I would be uploading EXIF data, and as @davanstrien showed above it is superfluous to the tags and information provided directly through Flickr. And Daniel thanks for the clarification on the model-generated labels. I agree that it makes sense to leave those out, since they can always be re-generated separately as new and better models are released.

A quick question on the metadata: Is it possible to upload multiple CSVs to an image dataset? I'm specifically thinking of the tag data, which I would have to come up with some sort of concatenation scheme to squeeze into a single table.

MikeTrizna commented 1 year ago

Oh, and @nabsiddiqui, where did you get the "Creative Commons Attribution Non Commercial Share Alike 4.0 International" license information that you show above? I haven't checked through all of the images, but it was my understanding that they are all marked as https://creativecommons.org/publicdomain/mark/1.0/. But if you saw somewhere in the documentation that maybe the dataset as a whole is CC BY-NC-SA 4.0, we can defer to that.

shamikbose commented 1 year ago

For multiple metadata tags, you could put it in a list/other iterable in the schema and it should be fine

On Thu, Jul 28, 2022 at 10:11 AM Mike Trizna @.***> wrote:

I don't think I ever mentioned that I would be uploading EXIF data, and as @davanstrien https://github.com/davanstrien showed above it is superfluous to the tags and information provided directly through Flickr. And Daniel thanks for the clarification on the model-generated labels. I agree that it makes sense to leave those out, since they can always be re-generated separately as new and better models are released.

A quick question on the metadata: Is it possible to upload multiple CSVs to an image dataset? I'm specifically thinking of the tag data, which I would have to come up with some sort of concatenation scheme to squeeze into a single table.

— Reply to this email directly, view it on GitHub https://github.com/bigscience-workshop/lam/issues/72#issuecomment-1198196854, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMD3OJJCAORVCQUWBCJ7SIDVWKIJTANCNFSM543J6GHA . You are receiving this because you were mentioned.Message ID: @.***>

--

-Regards, Shamik Bose

nabsiddiqui commented 1 year ago

@MikeTrizna I looked at the script that you had. The keywords, tags, etc. that I believe you have are copies of the EXIF data but they have additional metadata such as "ispublic" etc from the FlickrAPI. I am using the same thing and am cleaning the data up a bit so that it is easier to work with. For instance, the page information in the "keywords" Iv made into a separate column called "page". Hope that makes sense. I'll upload both the JSON and CSV following the suggestion of @shamikbose.

And yes, if you send me a CSV or JSON, I can merge it.

In terms of the copyright, I got it from https://about.biodiversitylibrary.org/help/copyright-and-reuse/#reuse. This may be incorrect.

davanstrien commented 1 year ago

A quick question on the metadata: Is it possible to upload multiple CSVs to an image dataset? I'm specifically thinking of the tag data, which I would have to come up with some sort of concatenation scheme to squeeze into a single table.

In the actual dataset you can have nested data/sequences so it could look something like:

image width height tags
image1.jpg 200 100 ["bird of paradise", "botany"]

This also makes it a bit easier to deal with tags of different lengths etc.

This is for example how the annotations are stored in this dataset: https://huggingface.co/datasets/biglam/nls_chapbook_illustrations.

Practically it's possible to upload CSV/JSON files etc that contain this metadata. The dataset script could then do a lookup to get relevant tags etc. @MikeTrizna I'm happy to have a look at how the data is structured at the moment and see how it might fit in a dataset loading script if useful.

MikeTrizna commented 1 year ago

@nabsiddiqui, thanks for pointing out that copyright page. That's the exact info I was looking for. I don't think I pulled down the image-level copyright status with my download, but I can go back and get that. If a large majority are public domain, we might be better off filtering to those so that we can label the whole dataset as such.

nabsiddiqui commented 1 year ago

@davanstrien yes. What you are describing is what I am working on doing. It requires extracting the tags from the JSON and cleaning it up. There are non-tags in the tags section so there are some errors that I have to go through and it is about 300k files.

@MikeTrizna I'm fine with doing that but don't know how we would filter it.

MikeTrizna commented 1 year ago

In checking out the example that @davanstrien linked, I noticed that the files are being pulled directly from NLS servers. I'm sure that the data custodians at BHL would be more amenable to directly serving these files from BHL servers (or the Smithsonian FigShare repo) to show a little bit more "ownership". I am in discussions to meet with the BHL data manager next week, and I can run it by them then.

So in the meantime, @nabsiddiqui, I would hold off on the JSON file you are building.

nabsiddiqui commented 1 year ago

Sounds good. I just have the script for the JSON file running in the background. I'll still upload it here just in case but we can delete it if it's not needed.

I wanted to practice using the Flickr API anyway.