drunohazarb / 4chan-captcha-solver

GNU General Public License v3.0
159 stars 3 forks source link

Did something change? #1

Closed frozenpandaman closed 1 year ago

frozenpandaman commented 1 year ago

As of yesterday it's now quite bad at solving, almost always missing characters. I'd say it got it right about 80-90% of the time before and now it's around 10-20% 5% (at most).

EDIT: ok i know the answer is "yes something changed" so this is really "will you (or someone) fix it please & thank you"

drunohazarb commented 1 year ago

Indeed. 4chan changed their captcha and the letters are now noticeably smaller, breaking most captcha solvers out there. I'm looking into it.

HairyMilkshakes commented 1 year ago

New captcha samples, a few thousand of them to help you with the new model. 1m831a.zip

drunohazarb commented 1 year ago

I have implemented the workaround used by JKCS (scaling the captcha) to make the solver a little more bearable to use while I figure out how to train the new model 0ad7f6f312bd0d3be132b8ded5bf9e4c6c3923a5

frozenpandaman commented 1 year ago

Working pretty well! Thanks for this! (Feel free to close the issue if you consider it more-or-less 'resolved' since I think this improves usability a lot.)

coomdev commented 1 year ago

@drunohazarb I'm JKCS' dev. I'm also working on training a new model (though I don't know much about what I'm doing), I managed to get some pretty promising results by training on the dataset from the zip published by Chance's, but I need many more captchas to make it at least as accurate as JKCS used to be.

I wrote this userscript to collect captchas, for 4chanX users since one person on /g/ said they'd be interested in contributing.

https://gist.github.com/coomdev/47d1b243a53c3690dcff14a6c01b9267/raw/dba569d8a93f0a8806c085e8a941a1b4d7e189f1/cc.user.js

frozenpandaman commented 1 year ago

Looks like others are collecting captchas too. I got a prompt within the KurobaEx app which links this issue: https://github.com/moffatman/chan/issues/144

JonseyJones commented 1 year ago

Quote: "the answer is I likely won't work on it completely i can share generation and training scripts with you or if you are allergic to ML I can share just the generation script and you'd update it to produce captches that look like current 4chan's or write your own I then would use your script tp generate a dataset of captches and train on it the one I made is written in perl" https://cdn.discordapp.com/attachments/1137092744741929051/1137093467374366720/generate.7z this is an example output needed to train the neural network https://cdn.discordapp.com/attachments/1137092744741929051/1137093824489984142/x.7z i use calamari ocr tpo train but again if you are not in capacity of doing that, I can train myself" "generate 50k captchas, train a neural network from scratch for it, run my script that converts the trained neural network into tensorflow-lite file, convert that file into base64, and copy-paste that into the userscript" From our man Automatic. Hope it helps, you guys can contact him on discord, see the link on his github.

moffatman commented 1 year ago

Now that kuroba-ex has added the option, I'm getting a lot more captchas. Will post once I have 10k in the next few days.

moffatman commented 1 year ago

You can grab the first 10k here. It does include the ~1.5k I shared previously.

https://captcha.chance.surf/images_10k.zip

I didn't check for any misspelled letters, sometimes people will type a similar looking letter, which is still accepted. But to train correctly, they should be renamed as follows:

const remap = {
  'B': '8',
  'F': 'P',
  'U': 'V',
  'Z': '2',
  'O': '0'
};
coomdev commented 1 year ago

Ok, I'm starting a training on a 75:25 split, if it looks good (and for now, metrics show it's bretty good) I'll release the notebook on JKCS and a beta model (since you'll need to implement both the image preprocessing and maybe inference) do a training on the full dataset and release a hopefully stable and complete model. I hope my laptop won't melt by then, it's smelling things I've never smelt before.

coomdev commented 1 year ago

image (those are validation samples that weren't trained on) looks bretty good, it fails with at most one letter, and 3/4 of those fails are things that the pre-process failed to properly cleanup. I'll publish the models on JKCS's repo with the jupyter notebook, go to sleep and continue when I wake up with cleaning up and full training, but in the meanwhile you can pull and evaluate the model and figure out how to integrate it in your apps. The model is 4M parameters, around 15MB, a ~50% increase from the previous model. Performance may be impacted on the userscript version, I suggest you figure out something to run the wasm version of tensorflow in a userscript.

moffatman commented 1 year ago

Do you have any estimate the % fully correct? I found on my own solver preprocessing is quite difficult now, the letters and noise are a lot closer in size. Just from that image -- 75%?

coomdev commented 1 year ago

Yeah, I just ran against the entire 2514 validation set, got 1916 correct (76%)

Yukariin commented 1 year ago

Fixed the model and trained it on combined dataset (10k from @moffatman + ~500 from @coomdev): image

captcha_75_25.h5 captcha_75_25.keras captcha_75_25.tf Updated training notebook here

moffatman commented 1 year ago

So what is the accuracy now?

Yukariin commented 1 year ago

It solved 2529 of 2653 val images, so ~95%

coomdev commented 1 year ago

Woah, impressive, especially with the way you cut the numbers of parameters. (Also you've put the wrong link for captcha_75_25.keras, it's the same as captcha_75_25.h5)

Yukariin commented 1 year ago

.keras file is basically a .h5 file - they're the same:

Screenshot 2023-08-10 at 22 44 26
coomdev commented 1 year ago

My bad. My exporter was shitting up because of the extension so I thought otherwise.

moffatman commented 1 year ago

This solver is awesome thanks @Yukariin. I wonder if any more images could be helpful? I have another 10k more now. But it is already pretty much there. Only thing I found, rarely it does confuse D and 0 (which is pretty difficult even by eye sometimes). Also I missed the mapping '5' -> 'S' when I posted before, so the model has been trained with some small number of '5's where it should be 'S'.

image
KurubaEX commented 1 year ago

Do I have to update the kuruba app in anyway to get the better captcha solver? Or is it automatic

HairyMilkshakes commented 1 year ago

Do I have to update the kuruba app in anyway to get the better captcha solver? Or is it automatic

Doesn't yours need a captcha APK as complement to autosolve?

KurubaEX commented 1 year ago

Do I have to update the kuruba app in anyway to get the better captcha solver? Or is it automatic

Doesn't yours need a captcha APK as complement to autosolve?

Yeah

Do I have to update the kuruba app in anyway to get the better captcha solver? Or is it automatic

Doesn't yours need a captcha APK as complement to autosolve?

Yeah. So this isn't for Android kuruba?

Yukariin commented 1 year ago

@moffatman more data can, probably, slightly improve model's accuracy. Especially for cases like D-0. Have you tried to run my model on your new data subset, what's the accuracy? I also noticed that in @coomdev dump the image PHPJR.png is stored in it's own directory named PHPJ with image named as just R.png. That's probably userscript or archiver bug.

moffatman commented 1 year ago

I actually had a bug in my own app UI, sometimes it showed letters as 0 instead of the proper one. After I fixed that, I only saw 1 mistake out of like 50+ tries (0 instead of A). I haven't wired it all up to run in batch yet.

coomdev commented 1 year ago

@KurubaEX I don't know what's kuruba, but you can't just swap the old automatic1111 model for this one, the input tensor shape is different, and there's also a preprocessing step to apply (connected component labeling then keeping the 8 biggest components (by area), then discarding components with height below 20px).

@Yukariin yeah looks like someone typed in a backslash in their captcha and it was still accepted by 4chan. My archive viewer still sees "PHPJ\R" as its own foldername, but yours may have interpreted the \ as a DOS directory separator. I'll fix this serverside, thanks for the report.

Yukariin commented 1 year ago

Updated notebook - fixed 5->S, added accuracy calculation on val data Also retrained the model, accuracy is sligthly better (95.76% vs 95.39% before): captcha_75_25.h5 Noticed errors on some samples, probably caused by the pre-processing algo - it discards letter parts. For example, the first Y: image

coomdev commented 1 year ago

Yeah, it's not perfect, because of the lines used to cut characters. I already tried to do an opening operation to work around that but it resulted in worse segmentation overall. Though I haven't tried doing that after eliminating the smaller noise... You could also try allowing more than the 8 largest components, assuming the model can withstand more noise. But the base accuracy is so high that the chance of failing two consecutive captchas is basically non-existent, so I don't think it's worth putting much more effort into preprocessing.

Yukariin commented 1 year ago

Yeah not sure if it’s worth it.

Yukariin commented 1 year ago

Okay, it's getting really interesting I decided to try to train the model on raw data without any pre-processing and results are surprisingly good. I got 2657/2695 on val data which is 98.59% I'd like to ask @moffatman and @coomdev to test it on your data and in you apps/extensions: Notebook captcha_75_25.h5 captcha_75_25.tf

moffatman commented 1 year ago

I dropped my own preprocessing after the new updates, it's too hard to do it correctly. So not surprising. Will try it on a ton of "previously unseen" captchas and update.

moffatman commented 1 year ago

Yesterday's model (idtjid.h5)

Correct: 0.9201347125330768 (3825 / 4157)

Today's model (9qsymy.h5)

Correct: 0.9579023334135194 (3982 / 4157)
Yukariin commented 1 year ago

The differnce probably because of pre-processing on the first model (idtjid.h5). Still, I'm surprized it shows 90+ accuracy on ~5k of "previously unseen" files.

moffatman commented 1 year ago

Actually, the problem was that I forgot to apply the substitutions to the validation labels. I was using preprocessing with old one and not with new one.

Yesterday's model (idtjid.h5)

Correct: 0.9485205677171037 (3943 / 4157)

Today's model (9qsymy.h5)

Correct: 0.9862881885975463 (4100 / 4157)

Crazy good!

coomdev commented 1 year ago

Damn, it turns out all I did was detrimental :^(

drunohazarb commented 1 year ago

80c0560