Closed frozenpandaman closed 1 year ago
Indeed. 4chan changed their captcha and the letters are now noticeably smaller, breaking most captcha solvers out there. I'm looking into it.
New captcha samples, a few thousand of them to help you with the new model. 1m831a.zip
I have implemented the workaround used by JKCS (scaling the captcha) to make the solver a little more bearable to use while I figure out how to train the new model 0ad7f6f312bd0d3be132b8ded5bf9e4c6c3923a5
Working pretty well! Thanks for this! (Feel free to close the issue if you consider it more-or-less 'resolved' since I think this improves usability a lot.)
@drunohazarb I'm JKCS' dev. I'm also working on training a new model (though I don't know much about what I'm doing), I managed to get some pretty promising results by training on the dataset from the zip published by Chance's, but I need many more captchas to make it at least as accurate as JKCS used to be.
I wrote this userscript to collect captchas, for 4chanX users since one person on /g/ said they'd be interested in contributing.
Looks like others are collecting captchas too. I got a prompt within the KurobaEx app which links this issue: https://github.com/moffatman/chan/issues/144
Quote: "the answer is I likely won't work on it completely i can share generation and training scripts with you or if you are allergic to ML I can share just the generation script and you'd update it to produce captches that look like current 4chan's or write your own I then would use your script tp generate a dataset of captches and train on it the one I made is written in perl" https://cdn.discordapp.com/attachments/1137092744741929051/1137093467374366720/generate.7z this is an example output needed to train the neural network https://cdn.discordapp.com/attachments/1137092744741929051/1137093824489984142/x.7z i use calamari ocr tpo train but again if you are not in capacity of doing that, I can train myself" "generate 50k captchas, train a neural network from scratch for it, run my script that converts the trained neural network into tensorflow-lite file, convert that file into base64, and copy-paste that into the userscript" From our man Automatic. Hope it helps, you guys can contact him on discord, see the link on his github.
Now that kuroba-ex has added the option, I'm getting a lot more captchas. Will post once I have 10k in the next few days.
You can grab the first 10k here. It does include the ~1.5k I shared previously.
https://captcha.chance.surf/images_10k.zip
I didn't check for any misspelled letters, sometimes people will type a similar looking letter, which is still accepted. But to train correctly, they should be renamed as follows:
const remap = {
'B': '8',
'F': 'P',
'U': 'V',
'Z': '2',
'O': '0'
};
Ok, I'm starting a training on a 75:25 split, if it looks good (and for now, metrics show it's bretty good) I'll release the notebook on JKCS and a beta model (since you'll need to implement both the image preprocessing and maybe inference) do a training on the full dataset and release a hopefully stable and complete model. I hope my laptop won't melt by then, it's smelling things I've never smelt before.
(those are validation samples that weren't trained on) looks bretty good, it fails with at most one letter, and 3/4 of those fails are things that the pre-process failed to properly cleanup. I'll publish the models on JKCS's repo with the jupyter notebook, go to sleep and continue when I wake up with cleaning up and full training, but in the meanwhile you can pull and evaluate the model and figure out how to integrate it in your apps. The model is 4M parameters, around 15MB, a ~50% increase from the previous model. Performance may be impacted on the userscript version, I suggest you figure out something to run the wasm version of tensorflow in a userscript.
Do you have any estimate the % fully correct? I found on my own solver preprocessing is quite difficult now, the letters and noise are a lot closer in size. Just from that image -- 75%?
Yeah, I just ran against the entire 2514 validation set, got 1916 correct (76%)
Fixed the model and trained it on combined dataset (10k from @moffatman + ~500 from @coomdev):
captcha_75_25.h5 captcha_75_25.keras captcha_75_25.tf Updated training notebook here
So what is the accuracy now?
It solved 2529 of 2653 val images, so ~95%
Woah, impressive, especially with the way you cut the numbers of parameters. (Also you've put the wrong link for captcha_75_25.keras, it's the same as captcha_75_25.h5)
.keras file is basically a .h5 file - they're the same:
My bad. My exporter was shitting up because of the extension so I thought otherwise.
This solver is awesome thanks @Yukariin. I wonder if any more images could be helpful? I have another 10k more now. But it is already pretty much there. Only thing I found, rarely it does confuse D and 0 (which is pretty difficult even by eye sometimes). Also I missed the mapping '5' -> 'S' when I posted before, so the model has been trained with some small number of '5's where it should be 'S'.
Do I have to update the kuruba app in anyway to get the better captcha solver? Or is it automatic
Do I have to update the kuruba app in anyway to get the better captcha solver? Or is it automatic
Doesn't yours need a captcha APK as complement to autosolve?
Do I have to update the kuruba app in anyway to get the better captcha solver? Or is it automatic
Doesn't yours need a captcha APK as complement to autosolve?
Yeah
Do I have to update the kuruba app in anyway to get the better captcha solver? Or is it automatic
Doesn't yours need a captcha APK as complement to autosolve?
Yeah. So this isn't for Android kuruba?
@moffatman more data can, probably, slightly improve model's accuracy. Especially for cases like D-0. Have you tried to run my model on your new data subset, what's the accuracy?
I also noticed that in @coomdev dump the image PHPJR.png
is stored in it's own directory named PHPJ
with image named as just R.png
. That's probably userscript or archiver bug.
I actually had a bug in my own app UI, sometimes it showed letters as 0 instead of the proper one. After I fixed that, I only saw 1 mistake out of like 50+ tries (0 instead of A). I haven't wired it all up to run in batch yet.
@KurubaEX I don't know what's kuruba, but you can't just swap the old automatic1111 model for this one, the input tensor shape is different, and there's also a preprocessing step to apply (connected component labeling then keeping the 8 biggest components (by area), then discarding components with height below 20px).
@Yukariin yeah looks like someone typed in a backslash in their captcha and it was still accepted by 4chan. My archive viewer still sees "PHPJ\R" as its own foldername, but yours may have interpreted the \ as a DOS directory separator. I'll fix this serverside, thanks for the report.
Updated notebook - fixed 5->S, added accuracy calculation on val data
Also retrained the model, accuracy is sligthly better (95.76% vs 95.39% before):
captcha_75_25.h5
Noticed errors on some samples, probably caused by the pre-processing algo - it discards letter parts. For example, the first Y
:
Yeah, it's not perfect, because of the lines used to cut characters. I already tried to do an opening operation to work around that but it resulted in worse segmentation overall. Though I haven't tried doing that after eliminating the smaller noise... You could also try allowing more than the 8 largest components, assuming the model can withstand more noise. But the base accuracy is so high that the chance of failing two consecutive captchas is basically non-existent, so I don't think it's worth putting much more effort into preprocessing.
Yeah not sure if it’s worth it.
Okay, it's getting really interesting I decided to try to train the model on raw data without any pre-processing and results are surprisingly good. I got 2657/2695 on val data which is 98.59% I'd like to ask @moffatman and @coomdev to test it on your data and in you apps/extensions: Notebook captcha_75_25.h5 captcha_75_25.tf
I dropped my own preprocessing after the new updates, it's too hard to do it correctly. So not surprising. Will try it on a ton of "previously unseen" captchas and update.
Yesterday's model (idtjid.h5)
Correct: 0.9201347125330768 (3825 / 4157)
Today's model (9qsymy.h5)
Correct: 0.9579023334135194 (3982 / 4157)
The differnce probably because of pre-processing on the first model (idtjid.h5). Still, I'm surprized it shows 90+ accuracy on ~5k of "previously unseen" files.
Actually, the problem was that I forgot to apply the substitutions to the validation labels. I was using preprocessing with old one and not with new one.
Yesterday's model (idtjid.h5)
Correct: 0.9485205677171037 (3943 / 4157)
Today's model (9qsymy.h5)
Correct: 0.9862881885975463 (4100 / 4157)
Crazy good!
Damn, it turns out all I did was detrimental :^(
As of yesterday it's now quite bad at solving, almost always missing characters. I'd say it got it right about 80-90% of the time before and now it's around
10-20%5% (at most).EDIT: ok i know the answer is "yes something changed" so this is really "will you (or someone) fix it please & thank you"