fake-name / xA-Scraper

69 stars 8 forks source link

[REQUEST] 'True' images & SHA256 #91

Open God-damnit-all opened 4 years ago

God-damnit-all commented 4 years ago

(1) As we've talked about before, due to things like metadata in particular, two images can have the same exact image data, but be different in size - sometimes even alarmingly different in size. What if, as a configuration option or otherwise, everything but the image data was discarded, like taking a candybar out of a wrapper, and then 'rewrapping' it in a new image file with no unnecessary data?

(Technically you could say this would be the same as discarding all unnecessary data, but it seems like this would be a safer route that wouldn't have weird edge cases with extra junk bytes here or there...)

(2) If xA-Scraper hashed all image data it downloaded with SHA256, there are a number of different methods one could approach de-duplication. I can't say I know what would ultimately be best, but the sooner that image data starts getting hashed, the better, because it'll be useful once that point is reached. An operation to hash all pre-existing images would be good too. It's just going to get more and more of an obstacle as the data hoarding piles higher.

fake-name commented 4 years ago

two images can have the same exact image data, but be different in size - sometimes even alarmingly different in size.

This is an inaccurate assumption. The two images will have almost identical image data, but they will differ. There are other approaches (that I've implemented elsewhere) to determine if one image is probably a copy of the other.

Every time a image is resaved as jpeg, the actual image contents are changed.

The issue here boils down to, given two images, determining which is the original.

Basically, as far as I can tell, this is an unsolved problem. I spent a bunch of time analyzing the issue a few years ago, and I couldn't figure out a heuristic for picking the original (or "True", if you will) image. For lossless compression, the problem is fairly irrelevant, but the way jpeg does image compression can actually add data, making the recompressed image more complex. This means you can't use image size and/or complexity as a proxy for determining ancestry.

I think, if you could constrain the problem to a single JPEG compressor, or a single jpeg compression level, this could be resolved, but unfortunately that's functionally impossible given we're dealing with content from the internet.

God-damnit-all commented 4 years ago

Every time a image is resaved as jpeg, the actual image contents are changed.

I almost included this in my OP but I thought it went without saying. This isn't what I was talking about, I was more referring to how difficult it is to deduplicate when minor differences in the file's metadata make it so you have to rely on hashing all of the image data, and I have seen some instances where a lot of junk was added that accounted for 20kb without any change in jpeg compression level or recompressing.

fake-name commented 4 years ago

There are places that are modifying images without recompressing?

Huh.

God-damnit-all commented 4 years ago

Actually, now that I think about it, I think the culprit is more that some sites strip excess metadata, but others do not - and the excessive metadata is coming from programs that artists are using.

fake-name commented 4 years ago

I've not seen any site that strips metadata without also recompressing (though I'm not looking too closely). Who does this?

God-damnit-all commented 4 years ago

I don't know why it didn't occur to me to have an example ready, so I opened up the folder where I save Discord stuff and plucked out an example as quick as possible. It uses a friend's private art that I'm sure he doesn't want reposted though so I'll email it to you.

My example uses imgur (and yes I know imgur is known for recompressing but it did not do it in this case, more details will be in the email).