Open e-caste opened 2 years ago
About point 1:
the nsfw-detector
library (Tensorflow-based) seems to work quite well with random images found online (TODO: test it against this dataset (just run the Docker commands docker build . -t docker_nsfw_data_scraper
and docker run -v $(pwd):/root/nsfw_data_scraper docker_nsfw_data_scraper scripts/runall.sh
): https://github.com/alex000kim/nsfw_data_scraper)
See the test notebook here: https://colab.research.google.com/drive/1y_w0t1ncJwN3vt4xNt0s_EIK_EHggO43?usp=sharing And the test data on Google Drive here: https://drive.google.com/drive/folders/1r_FKqwFpVnr29CtYYJAYTo0txAnjFNVy?usp=sharing
working on the point 4 for extracting the text on the images and I found the below python script with OCR that detects the text in save it in txt file. https://www.geeksforgeeks.org/text-detection-and-extraction-using-opencv-and-ocr/
Hugging Face just released huggingface.js, so this step could be delegated to their APIs entirely directly from the browser (must clarify API usage cost). https://github.com/huggingface/huggingface.js/blob/main/packages/inference/README.md
The current pipeline is purely focused on language-based moderation, meaning that it won't care about any image (which we will filter out based on our custom enhanced MarkDown format, ref #97).
The base64-encoded images provided by the users should also be checked for NSFW content. Find a pre-trained model for this task, we don't need to build a dataset of our own for this objective.
Also to note is that users could upload large/very large images, so the model (likely CNN-based) should automatically rescale them to a resolution which allows for a reasonable inference time.
This could be a starting point: https://github.com/SashiDo/content-moderation-image-api
base64 --input <path to image>