ImageMonkey / imagemonkey-core

ImageMonkey is an attempt to create a free, public open source image dataset.
https://imagemonkey.io
47 stars 10 forks source link

general label combining #141

Open dobkeratops opened 6 years ago

dobkeratops commented 6 years ago

Thinking again about what would enhance the dataset, and reconsidering after the huge improvement of the recent tweaks..

using a curated label list has an advantage of consistency,searchability, and (importantly) spam/abuse-control, (and possibly localization/translation ?)

Allowing generalised label 'blending' foo/bar might work around it's shortcoming (i.e. lack of descriptive power) .. by multiplying a curated vocabulary

imagine the following roadmap: (i)allow the system to blend any 2 (or more?) labels .. would it need to track this in a table, or could it be extended to make the label a string of id's)

(ii) include some labels intended mostly as prefixes .. adjectives etc.

the main ones that spring to mind would be materials (glass,metal,plastic, stone, brick,wood,paper,...) glass is a difficult one (glass (material) vs glass (cup)). I was noting this r.e. "wall", "building" (stone wall, brick building cobblestone pavement cobblestone road ..)

(iii) narrow it down by include a system to map synonyms to the combinations, and/or attach metadata that indicates how a label combines:

(iv) use this system and synonyms to rephrase the data or eliminate meaningless combinations

"rephrase": build up a table of more natural wordings.. and eventually allow the natural wordings for data entry

glass/orange juice glass of orange juice .. assume the meaning 'drinkware' because orange juice is a drink glass/cup .. 'glass (drinkware) https://en.wikipedia.org/wiki/Glass_(disambiguation) glass/object .. 'glass object' glass/window .. 'glass window' default assumption for window

One question: would phase one be two risky to put into production before the rest are sorted I can imagine a limited implementation e.g. 'some labels MUST be used in combination' - before everything is figured out, and assume the number of nonsensical combinations would be low. They could be 'blacklisted'.. or held in waiting, like the existing free-labels?

perhaps a way to deal with the ambiguous words like glass (material or cup), orange (color or fruit) would be to allow just writing them in combination, but then to verify how they translate in combination and replace with a specific variant

glass/orange juice -> glass (cup)/orange juice-> glass of orange juice glass/water -> glass (cup)/water (beverage) -> glass of water glass/window-> glass (material)/window ->glass window water/river -> water (material)/river body of water , maybe a river river/lake -> river or lake (both variations are an object) van/truck -> van or truck (both variations are an object)

I suppose it would be much safer to demand the ambiguous words be written out explicit from the outset.. i.e. just include labels glass (drinkware) glass (material) .. and rely on autocomplete to discover them. (heh right here I have a glass teapot, glass and teapot are both objects you'd find in the kitchen, would that confuse glass/teapot or is it enough to assume the material or part first for prefixes..)

I must admit this sound like it could explode in complexity pretty quickly or produce subtle paradoxes, on the other hand once you've blended 2 words does that narrow things down quickly enough most of the time?

starting out with explicit disambiguation of potentially ambiguous words might control it enough , I guess?

ok another troublesome example:- wheel/chair -> is that a wheel chair (for disabled people https://en.wikipedia.org/wiki/Wheelchair) , or a part (wheel) of an office chair. (one way around that, there's a more specific castor wheel which is used both for office chairs and bins https://www.castors-online.co.uk/?gclid=EAIaIQobChMImqGl4Zff2wIVxcwYCh0eUQYQEAAYASAAEgLqNPD_BwE https://en.wikipedia.org/w/index.php?title=Castor_wheel)

is it enough to just brainstorm each potential part, material word first - if you add specifics for the clashing combinations do you avert the hazard , i.e. if the list includes a dedicated wheel chair and castor wheel is it safe to allow wheel as a general purpose part?

wheel chair is a different object, and you probably still want to be able to say wheel_(part) of wheelchair_(object)

'a chair with a wheel' could be a wheelchair or an office chair.. but saying castor wheel narrows that down to office chair

dobkeratops commented 6 years ago

perhaps a safe approach would be to start with potential parts and materials written explicitly

objects:

forget generalised object blending; just allow part and material prefixes for regular labels.. you currently store labels as ID's ? .. you could approach compaction and discovery as a seperate problem (i.e figuring out shortcuts from individual prefix words , if we can figure out an unambiguous system)

bbernhard commented 6 years ago

using a curated label list has an advantage of consistency,searchability, and (importantly) spam/abuse-control, (and possibly localization/translation ?)

definitely!

I REALLY like the idea...it goes a bit into the direction of what i meant with "context aware labels" - I think having such a system in place could be really beneficial. It of course adds some complexity, but I think the staged concept (trending -> production) gives us a bit of safety here, that lets us introduce new metadata (or however we call it) gradually.

I suppose it would be much safer to demand the ambiguous words be written out explicit from the outset.. i.e. just include labels glass (drinkware) glass (material)

I think while we are collecting labels with the trending labels concept, that's perfectly fine. Writing out labels in an explicit way, definitely helps to understand what the label is all about. But for the final form...I am not sure...still a bit torn back and forth. My gut feeling is, that we might end up with really long labels to cover all the edge cases (label xy-but-not-vz-while-still-ab)

It's just my gut feeling (so I might be totally wrong on that ;))..but I think that with a generalized attributes/properties system it could be easier to form rules. Imagine that we have a color, size, material attribute.

We could now enforce that each label can have one or more colors, different materials but just one size. I think that could help quite a bit, when the number of labels per image grows. Another advantage I would see is, that we would have instant querability ("give me all labels where material = glass")

Another hope of mine is, that we can hide some informations behind icons...imagine a small "weight icon" below the label name that tells the weight of the object. I think with more and more labels it could be nice to group information that belongs together (maybe with some collapse/expand)

While a more restrictive metadata system could be quite helpful to new users (they can hardly do anything wrong, as we already have laid out all the valid label combinations) it makes it harder for power users. They now have to iterate between keyboard and mouse again. Except we find a way to let users define metadata via some sort of metadata language. (I am thinking about the famous vim here...it's hard to learn, but once you mastered it, it really speeds up your workflow)

bbernhard commented 6 years ago

perhaps a safe approach would be to start with potential parts and materials written explicitly

forget generalised object blending; just allow part and material prefixes for regular labels.. you currently store labels as ID's ? .. you could approach compaction and discovery as a seperate problem (i.e figuring out shortcuts from individual prefix words , if we can figure out an unambiguous system)

Oops, haven't seen that comment before. ;) If I am not completely mistaken, then that's almost exactly the same thing thing that I had in mind. :D

Maybe we can start with that simple rule: A specific part (e.q eye) always needs a parent label (e.q person).

Maybe we can use different icons in the label dropdown to distuingish parts from labels. That way we could get rid of the suffix (part). All what's then left is an easy way to "link" parts with labels (drag and drop? a special character that serves as link operator? e.q: eye -> person,...)

I think that could work well as long as we stick we only have parts and labels. As soon as we want other attributes like material, color... it could get more difficult. Not sure if the -> operator is sufficient then..

dobkeratops commented 6 years ago

as an experiment i've submitted some labels with relation combining foo in bar foo on bar (i suppose ..above.. ..behind.. ..infront.. could follow the same pattern), e.g. "smartphone in hand". I've still added the individual labels, but imagine if the system could have those kind of descriptions .. maybe this is something else a general label combining system could do.

I gather than CNN's tend to learn objects with a bit of their surroundings anyway - so maybe it would be quite natural to hint for the system that 'smartphone in hand' and 'smartphone on table' etc would have seperate features to look for.

most objects have a default I think, e.g. "car on road" (most cars will be). On the other hand, "person" is frequently seen either 'on pavement', or indoors

i know natural language is difficult to parse though.. wouldn't want to give the impression that you can describe the scene with completely general sentences .. thats' a long way off

bbernhard commented 6 years ago

as an experiment i've submitted some labels with relation combining foo in bar foo on bar (i suppose ..above.. ..behind.. ..infront.. could follow the same pattern), e.g. "smartphone in hand".

very nice!

i know natural language is difficult to parse though.. wouldn't want to give the impression that you can describe the scene with completely general sentences .. thats' a long way off

totally agreed, but the idea sounds very interesting. I think with the two staged label approach (trending -> productive) we would have the possibility to "learn" from the users input and incrementally allow more and more complex sentences/phrases. I think it could also be interesting to play with voice to text (i.e describe the image with a sentence). I imagine that could be interested for paralyzed people that can't move their hands anymore.

Another thought:

Do we want to have labels on a per bounding-rectangle basis? Imagine an image with two smartphones - the first one is lying on the floor, the other one on a desk. Imagine that the image already has the label smartphone. If we would now add the labels smartphone on floor and smartphone on desk , two new annotation tasks would be created.

I am wondering now, if it would make sense to add the information directly to the individual smartphone bounding box rects. I guess that labeling should usually be faster than annotating. So if we already have two individual bounding box rects, is it maybe more efficient to add the information about the position of the smartphone directly to the bounding box rectangle? Like a "label the bounding box rect" mode..similar to the quiz mode. You get a individual bounding box rect/polygon/etc. and add some information to it.

dobkeratops commented 6 years ago

Do we want to have labels on a per bounding-rectangle basis?

yes it would need that, one way would be to simply present 'smartphone in hand' as a seperate annotation task, another might be to have those pieces annoyed separately (as hand, smartphone), then link them.

but bear in mind even without anything like that, one could still use those as an image wide training output, or you could restrict it to simpler scenes where there is just one initially

I am wondering now, if it would make sense to add the information directly to the individual smartphone bounding box rects.

i guess it will depend on the case.. i think that would work best for items like smartphones for sure. i think we also talked about the idea of general annotated links between annotations (e.g. an arrow form one to other labelled 'in', etc)

we would have the possibility to "learn" from the users input and incrementally allow more and more complex sentences/phrases

imagine describing the more complex cases bracketed (LISP-style s-expressions), parse it that way if the first char in a label is an open-paren (but you you put the relation up front, or what.. (pizza on (plate on table)) vs (on pizza (on plate table)) .. decompose it to find the individual labels

bbernhard commented 6 years ago

yes it would need that, one way would be to simply present 'smartphone in hand' as a seperate annotation task, another might be to have those pieces annoyed separately (as hand, smartphone), then link them.

ah sorry, I completely misunderstood that. I thought that it might be enough to just focus on the smartphone bounding box and add the information about the smartphone's position (floor/desk) there. But that would require that the neural net can learn enough about the smartphone's position by just looking at the smartphone features. (not sure if that holds true?) So I guess you are right, a separate annotation task might indeed be better.

My main motivation for a generalized annotation refinement mode would be to reduce the number of annotation tasks (as I think that those are the most time intensive ones). I think in some cases it might be easier to enrich an existing annotation with additional information. So instead of adding the labels samsung smartphone or black smartphone I think we could add the information color: black and brand: samsung directly to a specific annotation.

But at the moment I am struggling a bit to find a good UIX workflow for that. I guess we could create a new view and call it e.q "annotation refinement". It shows you one specific annotation and gives you the possibility to add additional label(-properties). But I am not sure if it's good to create another view for that (I already have the feeling that we have quite a lot of them).

Any suggestions are welcome ;)

i think we also talked about the idea of general annotated links between annotations (e.g. an arrow form one to other labelled 'in', etc)

yeah, right. I think there is only a specific, limited set of label(-properties) that make sense to specify on a per bounding-box basis. I could imagine that it makes sense to restrict the allowed labels and only allow labels that enrich the object in question (e.q add information about material, color, brand,...).

I think we should be careful, and don't give users the possibility to abuse the annotation refinement mode. (e.q: draw a rectangle around the whole scene and use the annotation refinement mode to add person, car, ... labels).

bbernhard commented 6 years ago

I think a UI for a general annotation refinement mode is actually pretty hard to nail. Let's suppose for now that we don't care about the number of views and create another annotation refinement view.

The problems I would see with that approach:

I think the former is more compact and doesn't pollute the image grid that much. But it's hard to grasp which of those bounding boxes are well labeled and which not. With the latter, you can easily scroll through and see all the labels on a per bounding box basis...but it pollutes the image grid (you see n- times the same image)

dobkeratops commented 6 years ago

Each of those bounding boxes can now be refined with additional label(-properties). Now: Do we show all bounding boxes in the same image or do we split the bounding boxes up, so that we have 10 times the same image, always showing a different bounding box?

once per annotation is the only way really.. you'll also need to distinguish singular from group annotations because there will be cases like crowds or dense baskets which would have been annotated as one object.. .. that would be the first refinement to perform

regarding annotating singular/plural up-front, you might need to do those simultaneously (a toggle) if you want to keep the concept of annotating all in an image - e.g. with crowd scenes you would start annotating individuals up until a point, then do the rest as plural.. however the point where it's practical to do individuals is be open to interpretation - if you created seperate tasks for individual & crowds/group, it would be ambiguous as to if it was completed

bbernhard commented 6 years ago

because there will be cases like crowds or dense baskets which would have been annotated as one object.. .. that would be the first refinement to perform

good idea!

regarding annotating singular/plural up-front, you might need to do those simultaneously (a toggle) if you want to keep the concept of annotating all in an image - e.g. with crowd scenes you would start annotating individuals up until a point, then do the rest as plural.. however the point where it's practical to do individuals is be open to interpretation - if you created seperate tasks for individual & crowds/group, it would be ambiguous as to if it was completed

totally agreed. I guess we should allow both. Either you specify singular/plural up-front (where default is singular), or you can later add that information during refinement.

I tried to visualize how a browse based refinement could look like (yeah, I am really bad with Photoshop ;)):

The first version shows the per-bounding box version. I thought that it might be a good idea to show the information that's already collected besides the image. Hopefully that makes it easier to see where some information is missing. (We could also sort the result set, so that images with little/no additional information are shown first) browse_refinement_1

The second version is the all-in-one approach. Here we could show the labels when you hover over a specific bounding box. I think it doesn't make sense here to show the labels upfront, as it could get confusing with all the bounding boxes (re. what information belongs to what bounding box). But what we could probably do is to show a number...which represents the number of additional information for that specific bounding box.

browse_refinement_2

I think we can even make the information more compact by using icons instead of color:.

Does that make sense? Is it a good idea to use a browse based approach here?

dobkeratops commented 6 years ago

makes sense some of the refinements would be simply choosing new labels from the label-graph (e.g. car -> sports car, etc)

bbernhard commented 6 years ago

Is there a version you would prefer? I guess we can always change it, in case we realize that one variant is superior to the other. But for the initial implementation, it's good to start somewhere :)

What do you think about making that a separate view? Is that a good idea, or would you see a better place to add that functionality? I am a bit worried, that we end up with too many views (label, annotate, validate, validate annotations, quiz, annotation refinement..). But I can't think of any other place where that would fit.

Or should we just start with a separate view and worry about that later? I guess we could add some more top level menus and put less needed functionality in there.

dobkeratops commented 6 years ago

i wonder if you could do the work of quiz, and validation in Refinement - e.g. refinement is a more advanced/evolved version of quiz/validate. (validate individual annotations.. quiz driven by the label-graph)

Or should we just start with a separate view and worry about that later?

I suppose the best way would be to develop a new view, then see if it is feasible to replace the others.

(could refining a label at least validate that it was the previous one? 'invalidate' could be like re-winding an annotation backwards in the label graph? .. )

dobkeratops commented 6 years ago

another example:-

wall/fence

wasn't sure if i should annotate as seperate... but if you could have combined it, it's accurate to annotate the whole thing as a wall or fence. The one on the left is a fence ontop of a wall, but the one on the right is more like a single structure composed of pillars, low walls, and fences.. (pillars/wall/fence might be an even better description)

screen shot 2018-06-25 at 11 21 54
dobkeratops commented 6 years ago

Here's an example - complex roadworks - where one is unsure between road and pavement, because the road is actually being changed, and part is re-purposed temporarily by barriers.. road/pavement would cover the potential use of the ambiguous area , expressing the range of what it could be, but also the uncertainty.

screen shot 2018-06-25 at 19 11 31
dobkeratops commented 6 years ago

(another example... road partitioned off.. not currently drivable.. what to call it?

screen shot 2018-06-25 at 19 15 56