ImageMonkey / imagemonkey-core

ImageMonkey is an attempt to create a free, public open source image dataset.
https://imagemonkey.io
47 stars 10 forks source link

Label list stats and cleanup #287

Open dobkeratops opened 3 years ago

dobkeratops commented 3 years ago

Not sure if this is doable in spare moments..

Would it be possible to (I) get stats ie number of labels and number of annotations per label suggestion (I can see the list in api.imagemonkey.io/v1/label/suggestions) .. maybe this calculation already exists to figure out trending labels

(Ii) submit replacements to clean up the database (maybe in a JSON file, {“mistake”:”replacement”,...} ?

(Eg cleanup typing errors and simplify alternative suggestions for conventions. I speculate the “/“ hard seperators for combining will make life easier .. there’s a bunch of “Foo or bar” type workarounds that would make the labels harder to use)

I see the stats page list “20561 suggestions” .. I’d guess between mistakes and duplicates (spaces vs _ etc) that could be halved, and when extracting “/“ combinations maybe halved again

when making suggestions lately I’ve tried to stick to underscores and slashes eg “sports_car/luxury_car” makes the blend of 2 labels clearer than “sports car/luxury car” where a parser has to consider “(sports)(car/luxury)(car)” but there are older suggestions with spaces

bbernhard commented 3 years ago

Would it be possible to (I) get stats ie number of labels and number of annotations per label suggestion (I can see the list in api.imagemonkey.io/v1/label/suggestions) .. maybe this calculation already exists to figure out trending labels

I think that should be doable. Would an API endpoint be okay for that?

(Ii) submit replacements to clean up the database (maybe in a JSON file, {“mistake”:”replacement”,...} ?

That's a great idea! Would you prefer the cleanup job to be a one-shot mechanism (i.e rename the misspelled labels) or should that be a recurring job (e.g: "every time someone adds the label 'carr' rename it automatically in the background to 'car'")? The former shouldn't be that hard to implement, the latter most probably requires more work to get right.

It's probably a pretty tedious job to go through all the labels and fix them..but if you would like to compile a list I would be very grateful for that! Regarding the format: I don't have any particular format in mind. Of course, JSON would be great, as it's easy to parse. But writing JSON by hand is probably a nightmare. ;) So, I don't mind if it's something else (yaml, csv, or maybe a custom text protocol with your own custom separator, etc. ), as long as it's somehow parse-able without ambiguity.

dobkeratops commented 3 years ago

writing JSON by hand is ok .

An endpoint is fine for the stats. (could just add a note in the developer section to document it) Regarding the corrections .. a one shot cleanup is also fine. Ultimately the idea is to get the important labels officially supported... then people will find them through the menus & autocomplete. That’s the bigger goal.

What I’m hoping is the cleanup will increase the number of “hits” you’ll get for label suggestions.

have to think a bit about how the suggestions will fit with the original plan (properties). Maybe you could map them with aliases, or cleanup with Remapping . (Things like “luxury car” etc. They do often overlap , eg “luxury convertible car” etc, which does make sense with the properties idea. What I’m hoping is some combinations like that could be exposed directly as single labels so they can be annotated in one step.. but the properties system would still allow saying more later)

bbernhard commented 3 years ago

An endpoint is fine for the stats. (could just add a note in the developer section to document it)

Perfect!

Regarding the corrections .. a one shot cleanup is also fine. Ultimately the idea is to get the important labels officially supported... then people will find them through the menus & autocomplete. That’s the bigger goal.

What we could also consider is doing that in a two step process. i.e: first fix all the misspelled labels, remove the placeholder labels (I think there are some placeholder labels like qq, a, etc in there which are probably already mostly obsolete) and change them to match a specific labeling schema (e.g use spaces instead of '_', etc). Once we have that, we could think about whether we want to utilize the properties system to split them up even further (maybe we are already at a point where we could automatically split them up according to some predefined rules). But not sure if this makes sense or will just be more work in the end?

bbernhard commented 3 years ago

There's now a new API endpoint: https://api.imagemonkey.io/v1/label/suggestions/usage which returns a list of all label suggestions + the number of labels/annotations

(I am really bad at naming stuff, so it's possible that I'll rename the /usage endpoint again at same point, if I find another better suited name.)

dobkeratops commented 3 years ago

Nice, that’ll help a lot guiding cleanup and looking for tasks

dobkeratops commented 3 years ago

Just found my previous experiments using the label list .. I had something reading in “GloVe” word vectors , and someth8ing to find the unique words from compound labels EDIT: and yes it finds the unmatched words ie spelling mistakes - aprox 1300 although there may be a lot more label suggestions using them (Some of the mistakes are 2 seperate labels without a seperator)

It found a list of 4000 unique words used from all the suggestions (could just try to train using the images and that word list from all the split labels, ie 4000d output )

Looks like the spelling mistakes are all quite low frequency (eg less than 10 each)

It’s definitely nice being able to sort the suggestions by frequency now, I can also look for “the most popular unannoted labels” etc

dobkeratops commented 3 years ago

Few more stats .. I’ve tried filtering out the suggestions that come with graph nodes (eg“car->sportscar”,”tool->hacksaw”,etc) There’s 2728 suggested graph links. 658 nodes from a few common broad labels: tool,container,vehicle,animal,musical_instrument,food,person,furniture 313 “root nodes” from which everything else is reachable (a few spelling mistakes and plurals here).

there are unfortunately some spelling mistakes in the graph, but far fewer than in the full suggestion list

some of these are combinations that would be mappable to properties -potential properties have nodes, eg “luxury sports car” is documented as“luxury car->luxury sports car”, “sports car->luxury sports car” . there’s nodes for materials, eg “metal object->metal box”,”plastic object->plastic water bottle” etc

bbernhard commented 3 years ago

Awesome, thanks for sharing these stats!

some of these are combinations that would be mappable to properties -potential properties have nodes, eg “luxury sports car” is documented as “luxury car->luxury sports car”, “sports car->luxury sports car” . there’s nodes for materials, eg “metal object->metal box”,”plastic object->plastic water bottle” etc

I think removing materials from label names is probably a good first candidate for the properties system. e.g we could map the label name metal box to box with the material property metal. The only downside is probably, that it's maybe not clear anymore at first glance that the image has a metal box label. As the label then is just a box, you would need to click on the specific annotation to see its properties. Not sure though if that works with the way you label/annotate images or if that change would make the work for you more difficult?

dobkeratops commented 3 years ago

this might take some thought..

In favour of properties:

in favour of the graph nodes:

Is there a way to get the best of both ?

I’ve used “/“ for general label blending a lot.. I’m hoping a parts list (head wheel hand foot handle door etc) will allow filtering those out if we need it, otherwise I hope treating parts as yet another blendable word works fine (“head/cat” pixels are valid for a general “head” detector, or a general “cat” detector)

“/“ blending can combine individual property combinations eg “luxury_car/parked_car”, “sportscar/derelict_car” etc . I’ve tried to do this in some places. In others if combining multiple properties (“open top sports car”) I’ve setup graph nodes for that (“open top car ->open top sports car”,”sportscar -> open top sports car”)

What are people most likely to use? What will cause the minimum overhead for people trying to use this data? What is more extendable? What makes the best use of the systems and UI you’ve built so far? What’s the least work in terms of code retrofits ?

personally I find the label graph more appealing overall, eg being able to place abribtrary depth organisation over the existing labels (eg taxonomy of life.. “life form->animal->vertebrate->mammal->feline->domestic cat” .. let’s you group cat,lion,tiger to make a trainable output “feline”, one for “all vertebrates” combining what’s in common between lizards,mammals , etc), and the arbitrary blend offers a way to express multi property blend.

But ultimately both the UI and data should be convertible both ways .

573EC0F3-EE4C-4409-8463-D7DCD30067EC

dobkeratops commented 3 years ago

A few spelling mistakes in graph nodes, one important semantic mistake:

“toy->toy->toy->vehicle->toy_bus” should be “toy_vehicle->toy_bus” (This one would incorrectly flag all vehicles as toys. I added a “not_toy->vehicle” to draw attention to it)

box->carbboard_box. = cardboard_box car->couoe_car = coupe_car

there’s a few more spelling mistakes here and there but glancing through thats the only important “semantic mistake” I found so far

bbernhard commented 3 years ago

Many thanks, I'll write a small script to fix those issues in the database!

Regarding the graph nodes/properties discussion: That's a really tough call. My main argument for the properties system is that it allows to incrementally improve existing annotations without drawing polygons over and over again. So e.g someone starts by annotating a car. The next one could then add the property old to the existing annotation, to make it clear that it's an old car. Building upon that, the next one could then add a color property, or add rusty, etc. So, ideally you have to annotate an object only once and then refine it with properties. The advantage would be that, you could do all the refinement on devices with smaller screens.

If you want to refine a annotation done in the graph nodes style, I think the only option is to copy the existing label name, add the missing information and draw the polygon again.

But I can also see that the graph nodes style has some advantages too. As you don't need to jump between the labels and the properties list, it's easier to spot where information is missing. It's basically just a flat list which can be scanned pretty easily by eye. For me personally that's one of the biggest weak points of the properties approach. It's not easily possible to see how well the image is covered. Another weakness of the properties system is probably that it's not that comfortable to use at the moment. I think a bunch of hotkeys wouldn't hurt to make it accessible more easily and more convenient to use. The properties system was basically just a small experiment to see whether something like that could work...I guess there are probably a few things we could tweak.

For me personally both styles (graph node and properties) are fine. I think in the end it's not only important how the data is stored, but also how easy it is for people to contribute. As you are still by FAR the most active user, I don't want to implement something that kills your workflow. I've read so many articles over the years where developers implemented some cool sounding features which completely killed the service. Simply because they didn't listen to the needs of their users.