drunohazarb / 4chan-captcha-solver

GNU General Public License v3.0
173 stars 3 forks source link

New capcha model just dropped! #6

Closed JonseyJones closed 11 months ago

JonseyJones commented 11 months ago

Screenshot 2023-12-18 at 18-51-50 (4) _pol_ - Realistically how does Biden recover from this - Politically Incorrect - 4chan It has a big black circle with a random position.

sfeed1095 commented 11 months ago

you vil align ze circle and you vil like it

image

CaptainChicky commented 11 months ago

well if someone is able to manually compile a dataset of captchas+solutions (rip) they could retrain the nn

LittleEndu commented 11 months ago

how many examples would be required to retrain the model? Current model even gets the circle correct some of the time

moffatman commented 11 months ago

16k images of new captcha, it was able to converge since since 4k+, surprisingly easy https://captcha.chance.surf/bundle_16kblack/images.zip https://captcha.chance.surf/bundle_16kblack/model.h5

drunohazarb commented 11 months ago

@moffatman Thanks a bunch for training the model!

7826d5ae53da72a6cf64dd2a82a4a3f5aec557b9

Yukariin commented 11 months ago

it was able to converge since since 4k+, surprisingly easy

Was it really? Accuracy for old captchas (still in use) is ~78% but feels even lower - like 3-4 of 10 captchas being solved correctly. Fine-tuned @moffatman 16k model on combined dataset - 10k old + 16k new captchas from @moffatman and 3.5k old captchas from @coomdev. Achieves 98.7% (7388/7485) accuracy. model notebook

Also I'd recommend to use latest trained model to sanitise your own dataset - both old 10k and new 16k (and coomtech's one) contain some misaligned and/or mislabeled captchas.

moffatman commented 11 months ago

@Yukariin Ah, a hybrid model, really interesting! I didn't realize the simpler captcha still gets served sometimes. I have a lot more data, probably a million+ old captchas, and growing number of new captchas (50k atm). So I will play around with it. Thanks for all your contribution here!!

I did notice some misaligned, since I get most of these from kuroba users, and it doesn't use optimal alignment method.