[dataset request] Flickr

data-preservation-programs / slingshot

Official public repository for feedback and data collection in Filecoin Slingshot

https://slingshot.filecoin.io

68 stars 250 forks source link

[dataset request] Flickr #454

Closed xinaxu closed 3 years ago

xinaxu commented 3 years ago

Dataset being requested

Dataset Name: Flickr Dataset Description: Public photos published on Flickr Size: 50TiB File Format: jpg Link to Dataset: https://www.flickr.com/explore

Note this dataset is different from the existing data set "Flickr Commons". Flickr Commons are sets of public photo collections provided by participating institutions. This proposed dataset is for all public photos published on Flickr. Note the size 50TiB is based on Internet search and may not be accurate. Also, the size for Flickr Commons, which is also 50TiB is likely wrong as the number of participating institutions are very limited.

xinaxu commented 3 years ago

@dkkapur request to take a look

dkkapur commented 3 years ago

Thanks @xinaxu - generally for Slingshot, we qualify datasets that are not only openly accessible, but have public utility (i.e., the community benefits from these being widely available on Filecoin and can leverage the data in some specific way). As a result, a lot of the datasets tend to either be scientific datasets, software/OS images/containers that can be ran, or training data for AI models. For this dataset, do you have a particular use in mind? Given that flickr images can easily be accessed and retrieved through flickr's website today - what is the use case you'd propose for having these on Filecoin? Thanks!

xinaxu commented 3 years ago

That makes sense. The value I see for this dataset is, this will be a large collection of images, organized by cateogories or tags. One potential usage is supervised machine learning training using those images and their tags/categories. The trained model can be used for image classification. A regular Flickr user can browse images from the website, but it's not possible for them to download all pictures for a specific category, i.e. tagged with "cat". So depending on how we are going to index and store them in Filecoin network, it can be useful for clients to download all pictures with a specific tag or from specific user or public group.

orvn commented 3 years ago

@xinaxu @dkkapur, FYI, flickr-commons, exists in the current datasets

xinaxu commented 3 years ago

@orvn Thanks for pointing that out. The commons are the photos published by a selected list of organizations. In contrast, the project I am proposing here is the photos published by all public users. I was initially looking at the flickr-commons, then I figured that it is very easy for any user to download directly from Flickr and that also has a very small collection of photos (much less than 50TiB, probably around 500MiB). Then I was thinking about crawling the whole Flickr site for all public photos and that's where this proposal comes from.

dkkapur commented 3 years ago

@xinaxu - for specific datasets re training and tagging, I would suggest we use openly available Kaggle datasets instead where possible, since those are likelier to be relevant and standardized, as well as pass standard policies from the Kaggle terms. I'm not sure what Flickr allows/disallows and want to be careful of not accidentally pushing miners to store anything that may violate their regional policies and laws. Using the same standard dataset may also help in terms of ensuring multiple replicas are being stored and are geographically distributed and available across the network. What do you think of using specific Kaggle datasets instead - https://www.kaggle.com/datasets? An alternative could be to crawl Flickr and manually ensure that images found are compliant for miners, and then standardizing that dataset at an alternative download link, i.e., maybe even hosting your own Kaggle dataset, and then proposing that for Slingshot.

xinaxu commented 3 years ago

Sounds good. I'll close this one for now.