info - detailed-people annotations

Towards the 'pose-estimation' task (I'm interested in these kind of annotations with the goal of AI-assisted scanning)

We've got about 350 'more detailed' body annotations (eg joints and limbs)
maybe 1500 "low res" annotations with a pose hint, e.g "/sitting", "/walking" etc as part of the label
for faces, 350 "more detailed" face annotations (eye, mouth,cheek etc)
2000+ person annotations total (eg just saying man vs woman)

For reference I think facebooks detailed body remapping (pretty amazing) relied on 50,000 hand annotated examples - i'm guessing they'd have paid dozens of people to get that done. But that was a very high res task enough detail to figure out a full UV mapping for a surface.

What I wonder is how far we could get with transfer learning, e.g. training on video ("guess the motion vectors","guess the movie","guess the actor") with our smaller number of annotations for detail. Just videoing a crowd from a static location , 'anything moving=person'. we might also be able to render some synthetic data but that wouldn't be photoreal (its hard to get realistic human CGI .. it can be bought but gets pricey quick)

I think you showed an example of someone getting results from 80 annotations for recognising baloons, however that's a pretty simple shape and they were bright colours against natural backgrounds. I'd guess getting anywhere with people would take more.

As far as getting lots of examples done goes - it's really quick to hammer out large numbers of circles on an image; most of the time does go on selecting/serving images and switching labels (e.g. if you've got a large single image with many examples of something, I think you could get 1 annotation per second). (I recall that scrolling through video it was pretty easy to get examples out as crops using the mac partial screenshot tool)

Perhaps we could use FFMPEG to extract frames from video and paste together e.g. to get grids of 4x4 frames as single images annotating single images that are known positions within a video could enable interpolating the annotations frame by frame (we might not need a dedicated video annotation tool, although being able to see reference images side by side or something to confirm the same peices between 2 frames would help)

Label-switching hotkeys would also be a big help (low hanging fruit, IMO) - given an image you can paste the parts label list in, then you'd be able to work 2-handed i.e. left hand toggles labels using the keyboard, right-hand annotates; you'd be able to get up to the "1 part per second" speed

Some counts of existing annotations found by searching using "option:refinement". ("/walking", "/standing", "/running", "/sitting", are the main qualifiers for man & woman annotations)

left/cheek/woman 128 cheek/woman 181 left/cheek/man 3 cheek/man 24 cheek 16

woman 1272 woman/sitting 478 woman/standing 298 woman/walking 165 woman/running 9 woman/reading 1 woman/reclining 1 woman/sleeping 1 woman/asian 5 woman/european 1 woman/sittin/eating 2 woman/sitting/reading 8 woman/ridingBicycle 31 woman/ridingMotorbike 4

man 565 man/sitting 247 man/walking 119 man/running 10 man/lyingDown 2 man/ridingBicycle 38 man/ridingMotorbike 10 man/asian 3 person 2009

person 2000 head/person 230

arm/person 99 hand/poerson 160

hand/man 497 hand/woman 1223 foot/woman 364 foot/man 158 foot/person 51

ImageMonkey / imagemonkey-core

info - detailed-people annotations #281