ImageMonkey / imagemonkey-core

ImageMonkey is an attempt to create a free, public open source image dataset.
https://imagemonkey.io
47 stars 10 forks source link

Annotations tagged by user.. as property? #289

Open dobkeratops opened 3 years ago

dobkeratops commented 3 years ago

Regarding dataset potential for dataset vandalism, would it be possible to,automatically add a property to each annotation polygon e.g. “user = ”. And possibly a default for “not logged in” Maybe an,anonymous option for privacy would be preferable aswell. Would storing the IP (and maybe coarse time) suffice for “not logged in”?

What I have in mind is that validations could feed into a “user confidence”, and an aproximate weighting factor given to each not yet validated annotation.

I figured it would be useful per label aswell (in unlabelled images , labels per image can also be trained on) but I thought the property mechanism might give you an easy way to do this per annotation

perhaps a user + time stamp could account for the possilbity of a user getting better over time.

Would there be any downsides to this? Database bloat.. privacy?

It should be fine aswell to segment off a default for “all the annotations before tracking started”

bbernhard commented 3 years ago

Very interesting idea!

I quickly checked in the backend. Some of the information is already there (e.g: the timestamp when a annotation was created/updated), but it isn't stored in a way that it's easily accessible. But at least we could use the information that's already there to fill the gaps once we a proper tagging mechanism in place.

I don't know if you are aware of Browser Fingerprinting. It's a really interesting (but also pretty creepy) approach to uniquely identify someone in the internet. (if you are interested, you can check here whether your browser fingerprint is unique: https://amiunique.org). Since the beginning of ImageMonkey I am collecting fingerprints for annotations, labels and validations via this javascript library here: https://github.com/fingerprintjs/fingerprintjs. There's really a lot of information a browser leaks, but in order to preserve the user's privacy I am only storing the hash. That means, I cannot say who someone is, but given the hash I can say what the user has contributed to the dataset. The idea was to use that information at some point for something useful (e.g: to fight trolls that are uploading NSFW images). The only "problem" is, that adblocker like ublock origin (which I am also using) block those fingerprinting libraries per default. So I guess that's probably not something we could reliably use. Just out of interested I quickly checked in the database. I hope I haven't messed up the database queries, but if I am not totally wrong here, there are

labeling: 213552 total fingerprints; 320 unique fingerprints annotating: 65342 total fingerprints; 317 unique fingerprints

in the database.

In case you are interested in doing some offline processing and data analyzes, I guess I could expose the fingerprint information via a REST API call. (I am still working on the "query for un-annotated" feature, but I think I should be done with that end of this week/beginning next week)

perhaps a user + time stamp could account for the possilbity of a user getting better over time.

The username and a timestamp shouldn't be a problem. I am not sure regarding the IP address though. With GDPR I think I am required to show one of those annoying "privacy policy" popups.

I figured it would be useful per label aswell (in unlabelled images , labels per image can also be trained on) but I thought the property mechanism might give you an easy way to do this per annotation

Adding that information as a a property shouldn't be a problem. The only disadvantage would be that you would get that information also when you query the dataset for annotations/properties. But as every image/label/annotation/validation is a separate row in the database we could also just add the information there as well. That shouldn't be that much work.

dobkeratops commented 3 years ago

I like the idea of storing a hash (a kind of anonymised id I guess) I would be interested in doing some offline processing .. trying out ways of getting an overview of the data.

have you seen the news stories - “mit study shows the common datasets are riddled with errors”. And concern over biases. Does make it seem like ways of reviewing open data is needed - and people shouldn’t take imagenet,CIFAR etc for granted.

I guess you’re aware of the crypto surge aswell. I’m sure someone will end up doing a “label coin” of sorts. I’ve generally had a negative opinion of crypto because of these GPU shortages (ie it’s extremely hard to get a 30x0 now).. But I wonder if you’ve considered trying to make this platform incentivise people by issuing a coin (probably too much work for you in the few spare time slots you have around a job). I’m not quite sure how you’d track everything but maybe these user tags could be a step toward that. also I’m not sure how it would mesh with open sourcing. Imagine if people had the option of making their annotations premium data. Perhaps this kind of per annotation tagging could feed into something like that eventually. Seeing some of the insane hype around NFT art it makes me wonder if you could sell “label coins” speculatively ) Reading up on this world I do like the idea of “IPFS”

bbernhard commented 3 years ago

I like the idea of storing a hash (a kind of anonymised id I guess)

yeah, right :)

have you seen the news stories - “mit study shows the common datasets are riddled with errors”. And concern over biases. Does make it seem like ways of reviewing open data is needed - and people shouldn’t take imagenet,CIFAR etc for granted.

Many thanks for mentioning that - the article totally slipped through my news filter! I have to admit that I always found it pretty surprising that everybody was using datasets like ImageNet without questioning the quality of the dataset. That's btw. also one of the reasons why the "validation" feature in ImageMonkey is there since the early days of the platform. I really wanted to make that one of the core features.

Regarding the crypto coins: I thought about something similar before, but without the crypto part (as it's probably quite a lot of effort to implement). I've sketched out the idea of micro payments here a bit. I think the platform is already in a state where it would be possible to do something like that.

e.g:

So basically something similar to Amazon's Mechanical Turk, but where all data is made available to the public (after some time of exclusive access). I think something like that would be kinda cool. But I think most people probably want to have exclusive access to the data and don't want to share anything (as they lose their competitive advantage)