activeloopai / deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
https://activeloop.ai
Mozilla Public License 2.0
8.04k stars 614 forks source link

[FEATURE] Detect duplicate samples when adding new data to tensors (images) #1757

Open michelemoretti opened 2 years ago

michelemoretti commented 2 years ago

🚨🚨 Feature Request

Is it possible to discard samples in case they are already present in the dataset? If not, would this be something interesting to implement? I feel like this would make the dataset extension pipeline much easier to use and implement

davidbuniat commented 2 years ago

hi @michelemoretti, thanks for the feature request! yes, this is a great idea but would add some overhead for computing hashes of the data while ingesting (assuming they are exactly the same images).

Can you tell us more about how this would simplify the dataset extension pipeline for you (maybe just illustrating by an example)? The answer would help us with prioritizing.

michelemoretti commented 2 years ago

Hi David, I was thinking of my most frequent use case, in which I receive the dataset in batches as it is collected. Updating the dataset would require me to change the script (that appends the samples to the dataset) to target only the new samples received. Having a syntax to update the dataset by feeding it all of the samples would allow me to use the same script for both creating the original dataset and updating it by appending all of the samples (but on the backend only the new samples would be appended). Hope that was clear enough.

davidbuniat commented 2 years ago

Got it, @michelemoretti, just to make sure we are on the same page, are those repeating images pixel-perfect exactly same or still there could be some minor changes between those?

michelemoretti commented 2 years ago

Absolutely. We're talking about identical files/images.

protocolog commented 2 years ago

I want to work on this issue. Please assign me this issue. Thanks.

protocolog commented 2 years ago

I want to work on this issue. Please assign me this issue. Thanks. @michelemoretti @davidbuniat @sgrove @jraman

mikayelh commented 2 years ago

hey @protocolog , thanks a lot for your contribution, and apologies for the late reply. Assigned the issue! You can join the Activeloop community slack (slack.activeloop.ai) to ask questions. :)

protocolog commented 2 years ago

Please assign #1757 issue, You assigned me but my profile is showing not assigned. your slack link is not working, please give the alternate source of contact@davidbuniat @mikayelh @michelemoretti @sgrove

mikayelh commented 2 years ago

@protocolog apologies, fixed the link. Please refrain from tagging people who are not involved in this conversation to spare their inboxes. Thanks. :)

protocolog commented 1 year ago

I am unable to join the workspace on slack. Please help me out. My slack ID is h20220047@goa.bits-pilani.ac.in , Thanks @davidbuniat @mikayelh @michelemoretti

nmichlo commented 1 year ago

This is an interesting problem that I face, but not just for identical images, but near-identical images. Have not actually tested this workflow, but I imagine this could be done by generating a perceptual hash (or normal hash) of the image (eg. with the imagehash lib) and store this in a separate tensor that corresponds to your main image. Then on ingest you can query against this to skip duplicates or near duplicates based on the hash approach that you choose. You could even adjust this for KNN matching too or embeddings instead of hashes.

mikayelh commented 1 year ago

thanks a lot @nmichlo for chiming in here and the suggestion! @protocolog I've re-sent you an invite to our slack but I noticed that you joined. Let me know if you have other questions:)

I'm also tagging @istranic here in case he thinks this can be included on the roadmap. :)