The 10k milestone - Githubissues

bbernhard commented 6 years ago

One of the next big topics on my list is to get our dataset to 10k images. With this ticket I want to brainstorm ideas in order to focus on the most promising approach first. This could also be a good opportunity to find an image source that goes together with one of the use cases defined here (see https://github.com/bbernhard/imagemonkey-core/issues/45)

Some ideas:

import CC0 licensed images from pexels, unsplash, pixabay, flickr or wikimedia images: As those images are more or less easy to gather, it could be a good source for images. But probably pretty hard to find images for those use cases: https://github.com/bbernhard/imagemonkey-core/issues/45
use a dashcam to film roadtrip and extract frames: Would be ideal for the use case 'self driving car'. Unfortunately it looks like that you need to blurry out people's faces and license plates (at least according to the Austrian law) before you are allowed to publish those images.
extract frames from publicly available security cameras (https://www.insecam.org/). Unfortunately I am afraid that we run into the same problems as with the dashcam approach (i.e we need to make sure that we repect people's privacy).
import an existing dataset (LabelMe, OpenImages[1]): while I think this could really boost our dataset, I am not sure if it's the best way to reach the 10k milestone. We would probably need to add support for different CC licenses and add the possibility to add a attribution to each image/annotation.
creating frames from https://www.whitehouse.gov/featured-videos?page=1 videos. Those videos are licensed under CC BY 3.0 (so attribution would still be required), but it could be great source to build a faces/people/politician dataset.

[1] https://github.com/openimages/dataset

dobkeratops commented 6 years ago

Unfortunately it looks like that you need to blurry out people's faces and license plates (at least according to the Austrian law)

could that be part of the labelling process :) "cover all the faces" .. "cover all the license plates" .. although I suspect the law you describe precludes showing the image in the first place :(

Is this kind of law 'resolution dependant' , I would assume there's a difference between a general photo of a crowd, versus a paparazzi-style photo zoomed on one individual.. (spirit of the law, vs letter of the law).

I suppose if you get these from wikimedia etc .. they've already got a pass over the legalities.

The LabelMe dataset is pretty good, but I worry it isn't as actively contributed as something like wikimedia

bbernhard commented 6 years ago

could that be part of the labelling process :) "cover all the faces" .. "cover all the license plates" .. although I suspect the law you describe precludes showing the image in the first place :(

I think so too. :/

Is this kind of law 'resolution dependant' , I would assume there's a difference between a general photo of a crowd, versus a paparazzi-style photo zoomed on one individual.. (spirit of the law, vs letter of the law).

I am not a lawyer, but I think I have read somewhere that it is okay if the picture focuses on a crowd instead of a single person.

I just checked and found quite a few projects that focus on license plate recognition [1] and face detection [2], so we might be able to create frames from dashcam videos and run those through some postprocessing steps to anonymize them. I think dashcam videos could really provide some interesting objects to highlight (zebra crossings, traffic signs, lanes...)

[1] https://github.com/openalpr/openalpr [2] https://github.com/ageitgey/face_recognition

bbernhard commented 6 years ago

I just looked a bit on Youtube and found some interesting driving videos (e.q [1] ).

It looks like that the user [2] who made this, is also on reddit. So it might be worth a try to send him/her a private message to see if it would be ok to use frames from those videos for our dataset. I think it could be interesting to capture a frame every 10 - 30 seconds and feed that into the database. As we already know that it's about driving we could also give it some appropriate labels/metalabels upfront.

[1] https://www.youtube.com/watch?v=rM8dbiH0kfY [2] https://www.youtube.com/channel/UCARyBOyRHDptlg6_jKCcHkQ/about

bbernhard commented 6 years ago

Picking this one up again:

I was thinking quite a lot about it lately, and I think importing (parts of) the labelme dataset might be worth a try. The cool thing about the labelme dataset is, that all images are public domain...which means that we do not need to add support for more restrictive CC licenses at the moment.

My main concern however, is still, that I don't know if the labelme creators would be happy to see there data appear on another site. I haven't found anything on their site that would prevent anyone from doing so, They even wrote this on their help page:

You are free to post your collected database of images and annotations on your own website. For example, you may do this when you release your database with your research publications.

So I think we should be fine, but reaching out to them first, is probably more respectful, than just scraping their data. So I'll try that first. Unfortunately it looks like that it's harder than I thought. The github tracker seems to be unmaintained (at least there are a bunch of unanswered questions) and the discussion forum seems to be dead. @dobkeratops As you have contributed quite a lot to labelme...do you have an idea how I can get in contact with those guys?

dobkeratops commented 6 years ago

r.e. contacting 'the labelme guys' - I've no idea beyond the public email addresses given;

I suppose you could mail and inform them and say "do you have any objections.." - perhaps if scraping, make the process reversible, and perhaps keep reference to source (display a link to labelme ?). I'm sure they'd welcome someone building on their work, if the data sharing is reciprocated (e.g. allow them to get access to the validations your tool focusses on)

dobkeratops commented 6 years ago

(r.e. validation - one thing about LAbelMe is there's a lot of adhoc naming conventions. going back over their data might be really useful in sifting through those ,nd mapping to something universal)

bbernhard commented 6 years ago

r.e. contacting 'the labelme guys' - I've no idea beyond the public email addresses given;

Many thanks, i'll try that then :)

I mean even no answer would be ok for me. In that case I would assume that they don't have any objections against it.

(r.e. validation - one thing about LAbelMe is there's a lot of adhoc naming conventions. going back over their data might be really useful in sifting through those ,nd mapping to something universal) totally agreed. In a first step I would just look for annotations/validations that match our existing labels..mainly to get used to the annotation format labelme is using.

After that, I am pretty sure your work label hierarchy analysis tool you wrote, will be useful to get a better understanding of the dataset and the labels in use. I'll defintely keep a reference to labelme...just in case the project doesn't work out as expected and we need to delete the imported data again.

bbernhard commented 6 years ago

short status update: I haven't contacted the labelme guys yet, as I first want to verify if it's even possible to import the data with feasible effort.

After some days with the labelme dataset, I finally managed to import the first annotated image into my ImageMonkey dev-environment (see attached screenshot).

labelme_first_import

Not sure yet, why the first car is labeled twice...either I messed something up, or the object is annotated twice.

At the moment I am only looking at polygons. But I just noticed that there are many more features available (scribbles, masks, bounding boxes...), So i'll definitely look at those next. :)

dobkeratops commented 6 years ago

awesome! I think Polygons would be the lions share of the data there.. might be worth adding bounding box support too. most likely it's labelled twice? the other possibility is that it keeps deleted old labels. one person might have labelled it coarsely, then another came along and decided to improve on it (but one cannot delete the other's labels). I suppose a labelling system can accumulate multiple labels to form a fuzzy greyscale to account for such ambiguity

bbernhard commented 6 years ago

most likely it's labelled twice?

yeah, it definitely looks like it. There is also a deleted property in the XML file, but that's set to 0 for both annotations.

My biggest "problem" at the moment is, that I don't really know how good the dataset's quality is. I just looked at one sample XML file and found quite a few strange things (the same object labeled multiple times, annotations with just one poly point..). But I assume that's normal for a dataset that grew over the years. I really hope we can find a way to clean up a few of those things before importing the dataset.

bbernhard commented 6 years ago

Some Ideas:

I think we do not necessarily need to import everything from labelme (at least not in the first iteration). Maybe we can focus on the "high quality annotations" first:

If I would need to guess, I would say that the early contributions are most likely of higher quality than most of the older ones (as the earliest contributions are probably from the labelme creators).
We could try to find "trustworthy" users (i.e users that usually produce high quality annotations) and just use data from those users.

Maybe I am too conservative here, but I would really like to avoid that we end up with a lot of wrong annotations.

dobkeratops commented 6 years ago

We could try to find "trustworthy" users

sounds good.. perhaps you could determine this through your validations - keep all the raw labelme annotations, but calculate a user validation score, and weight the annotations by their user score. give unvalidated users a 50% score, (whatever works..), then upgrade/downgrade users toward 100%/0% as validations come through

dobkeratops commented 6 years ago

(FYI , my own annotations on 'labelme' are under the username 'arandomlabeller'; they're mostly bounding boxes, some polygons, some heirachical bounding boxes; they're mostly of street scenes i.e. cars/people/etc, (headlights/wheels/licenseplate, head/hand/foot etc as parts)

bbernhard commented 6 years ago

sounds good.. perhaps you could determine this through your validations - keep all the raw labelme annotations, but calculate a user validation score, and weight the annotations by their user score. give unvalidated users a 50% score, (whatever works..), then upgrade/downgrade users toward 100%/0% as validations come through

Interesting idea! If I got you right, then this will be done after the labelme dataset is imported, right?

My only concern with that is, that it probably takes a pretty long time until we figured out which annotations are good and which ones are bad. If the majority of the labelme dataset is of good quality it probably doesn't matter, but if a significant amount of data is wrong, we could increase our error rate temporarily, until we/other users have validated all the annotations. As the labelme dataset is pretty big, that probably takes a while...

At the moment I am doing some cleanup work every now and then in the ImageMonkey dataset. If I detect a annotation that is clearly wrong I am deleting the annotation, so that it moves back into the "to be annotated" pool. That's just a temporary solution, until there is a "refinement/vote for the best annotation" mode in place.

As the number of annotations is growing slow (~20-100 annotations per day at a max) that works pretty good. My concern is now, that if I bulk import the whole labelme dataset, that I will lose the ability to keep an eye on the dataset's quality. Of course, that approach doesn't scale...so there will be a point in time where this approach doesn't work anymore. And as the API is completely open, somebody else could just import the labelme dataset as we speak...there is no mechanism in place that would prevent that.

At the moment I am still playing a bit with the dataset...to get a better understanding of it's structure and quality. Hopefully that leads to a better understanding of the labelme dataset...:)

(FYI , my own annotations on 'labelme' are under the username 'arandomlabeller'; they're mostly bounding boxes, some polygons, some heirachical bounding boxes; they're mostly of street scenes i.e. cars/people/etc, (headlights/wheels/licenseplate, head/hand/foot etc as parts)

Awesome! I'll have a look, many thanks!

bbernhard commented 6 years ago

short status update:

My LabelMe import script is now ready to that point, that it's possible to scrape the dataset for a given label and automatically upload all the pictures tagged with that label to ImageMonkey. Internally, a reference to the LabelMe dataset is kept, so we have a 1:1 mapping to the corresponding LabelMe picture. At a first test, I would just bunch-upload the pictures (each image shrinked to a max of 1000px; I think that should be enough) with labels that we are already aware of (e.q 'car').

As we have a 1:1 mapping to the LabelMe dataset, it should be easily possible to add the annotations and more labels later on. I also changed the sourcecode, so that imports from a "trusted" dataset (like LabelMe) automatically adds a validation entry per default.

Yesterday, I wrote a mail to (who to my knowledge) is the LabelMe founder and asked him if he has any objections against importing the data to ImageMonkey. Of course, we would keep a reference to LabelMe...so credit goes to LabelMe.

Let's see :). Currently I have 3000 'car' images sitting here, waiting to be imported ;)

bbernhard commented 4 years ago

closing this one as we've reached the 10k images a while ago :)

ImageMonkey / imagemonkey-core

The 10k milestone #46