FelicePollano / WatermarkDataSetBuilder

Create a dataset of images with and without a random whatermark, based on images downloaded from Pexels https://www.pexels.com/
MIT License
1 stars 0 forks source link
datascience dataset image-classification netcore31

WatermarkDatasetBuilder

This is a .NET core project that allow to create a dataset with watermarked and not watermarked pictures in order to use for classifing images according to that. The project is to support my custom experiment while attending the Cousera course Convolutional Neural Networks in TensorFlow Couse The watermarked pictures are randomly synthetically generated with random meaningless words, random colors, size and positioning.

The image are downloaded thanks to Pexels API.

Pexels provides high quality and completely free stock photos licensed under the Pexels license. All photos are nicely tagged, searchable and also easy to discover through our discover pages.

You can find better detail about the license here

If you want to run the code to download a dataset you first need to obtain an API key from here.

How to build

This program requires the NET Core SDK 3.1 or better to compile. Clone the project. when SDK is installed just run this command in the root folder of the project

dotnet build

How to run

then you can launch the downloading by using:

dotnet run <YOUR API KEY> <output folder> <searchitem1> optsearchitem2 optsearchitem3 ... optsearchitemN

As download starts you can interrupt it by breaking it Ctrl-C otherwise it will continue until it reach the API limitation.

Dataset structure

The dataset is structured as below:

/output-folder
|-----train
|    |------no-watermark
|    |------watermark
|-----valid
|    |------no-watermark
|    |------watermark
| .checkpoint

This should help to load the image by using the Keras ImageDataGenerator

Data are splitted across train and valid with a proportion of 80/20. Watermark/not watermark ratio is supposed to be 50/50, but can sligthy change due to image processor errors.

Please note the (hidden) file .checkpoint which pourpose is to restart from where the download left in case of any kind of stop ( even API limitation). If, for some reason, you want to start from scratch, just remove this file.

If you need to see some noteooks created using this datasource, have a look on the Kaggle page thre is some projecy from me and from others.

Examples

Easy to recognize watermarks

Difficult to recognize watermarks

Can you spot it?