Dataset quality - Githubissues

DoubangoTelecom commented 3 years ago

Hello,

I'd like to thank you for the amazing work you have done. This said, we have some issues with your dataset: 1/ We have a tool to check images at pixel level to see if it's compressed more than once or recaptured. Many of the images you have shared as live are clearly not. Some of the images you even don't need a tool to notice it. 2/ CelebA dataset was obtained using web scraping, there is no scientific basis to assert that these images are live. 3/ Most of the spoof images are heavily distorted, the faces have exaggerated aspect ratio. This make it very easy to find them. 4/ A good number of the live images are heavily edited (resize, histogram change...). I understand this is not about image forensic but I don't think these images should be part of the dataset. 5/ ...

We don't use your dataset in our product, we were just curious to check it, have an online tester (https://www.doubango.org/webapps/face-liveness/) and a SDK in production. We have cleaned up your dataset and can share it here if you give us the authorization.

kemalaraz commented 3 years ago

I am working on a similar project so I'd be glad if you can share the cleaned version.

Galeos93 commented 3 years ago

It would be much appreciated to have the cleaned version since this dataset has lots of issues.

Arvindia commented 3 years ago

@DoubangoTelecom much appreciated if you could share the clean version. We'll site your work as well in our research publications.

gagarinsname commented 3 years ago

@DoubangoTelecom it would be wonderful for the whole research community if you shared the cleaned version here. I tried to use this DB in my PhD tiny face liveness research and was surprised by such a poor quality of this dataset.

IMO it seems to be a common issue for most open source anti-spoofing DBs regardless of the community efforts.

DoubangoTelecom commented 3 years ago

@DoubangoTelecom it would be wonderful for the whole research community if you shared the cleaned version here. I tried to use this DB in my PhD tiny face liveness research and was surprised by such a poor quality of this dataset.

IMO it seems to be a common issue for most open source anti-spoofing DBs regardless of the community efforts.

We haven't received the authorization from the owners to share the clean version. There may be legal issues we have to sort out once we get the authorization.

DoubangoTelecom commented 3 years ago

@DoubangoTelecom it would be wonderful for the whole research community if you shared the cleaned version here. I tried to use this DB in my PhD tiny face liveness research and was surprised by such a poor quality of this dataset. IMO it seems to be a common issue for most open source anti-spoofing DBs regardless of the community efforts.

We haven't received the authorization from the owners to share the clean version. There may be legal issues we have to sort out once we get the authorization.

From https://github.com/Davidzhangyuanhan/CelebA-Spoof#dataset-agreement you have You agree not to further copy, publish or distribute any portion of the CelebA-Spoof dataset. Except, for internal use at a single site within the same organization it is allowed to make copies of the dataset.

gagarinsname commented 3 years ago

@DoubangoTelecom it would be wonderful for the whole research community if you shared the cleaned version here. I tried to use this DB in my PhD tiny face liveness research and was surprised by such a poor quality of this dataset. IMO it seems to be a common issue for most open source anti-spoofing DBs regardless of the community efforts.

We haven't received the authorization from the owners to share the clean version. There may be legal issues we have to sort out once we get the authorization.

From https://github.com/Davidzhangyuanhan/CelebA-Spoof#dataset-agreement you have You agree not to further copy, publish or distribute any portion of the CelebA-Spoof dataset. Except, for internal use at a single site within the same organization it is allowed to make copies of the dataset.

@DoubangoTelecom I wonder if the dataset image file name list can be considered as "a portion" of the dataset. I mean you could release a "cleaned" file list for "live" class part with all printed/low quality/poster rubbish faces removed and it will improve the DB quality tenfold.

P.S. Of course, I'm just kidding and it would be great to receive the DB owners response.

DoubangoTelecom commented 3 years ago

5 months now and still no response from the owner while he is active on other tickets. So, closing the ticket and moving to other projects.

ZhangYuanhan-AI commented 3 years ago

Thanks for the interests of our work. We apologize for the aboved mentioned data quality problem. We will communicate with the legal department of Sensetime about this thing and response once we have made a decision.

ZhangYuanhan-AI commented 3 years ago

Hello,

I'd like to thank you for the amazing work you have done. This said, we have some issues with your dataset: 1/ We have a tool to check images at pixel level to see if it's compressed more than once or recaptured. Many of the images you have shared as live are clearly not. Some of the images you even don't need a tool to notice it. 2/ CelebA dataset was obtained using web scraping, there is no scientific basis to assert that these images are live. 3/ Most of the spoof images are heavily distorted, the faces have exaggerated aspect ratio. This make it very easy to find them. 4/ A good number of the live images are heavily edited (resize, histogram change...). I understand this is not about image forensic but I don't think these images should be part of the dataset. 5/ ...

We don't use your dataset in our product, we were just curious to check it, have an online tester (https://www.doubango.org/webapps/face-liveness/) and a SDK in production. We have cleaned up your dataset and can share it here if you give us the authorization.

Hi, please clarify the following things:

Which parts of images do you want to clean（the path list）.
In what form do you want to share this cleaned version data.
What license do the cleaned version data have.

fdfdsf commented 2 years ago

@DoubangoTelecom and @Davidzhangyuanhan It would be much appreciated to improve the dataset and share the clean version. We'll site your works as well in our research publications.

ZhangYuanhan-AI / CelebA-Spoof

Dataset quality #49