debdutgoswami / aadhaar-verification

I tried to make a simple python project to check whether a particular Aadhaar Card actually exists or not.
Apache License 2.0
7 stars 2 forks source link

automate Captcha recognition #1

Closed debdutgoswami closed 4 years ago

debdutgoswami commented 4 years ago

I am looking for a suitable way for automating the Captcha Recognition.

Things I have already tried: I have tried using pytesseract but it didn't work. I had pre-processed the captcha image before passing it onto pytesseract.

Possible solution: Using Deep Learning is one of the possible solutions but I am looking for something other that Deep Learning.

I you have any solution feel free to suggest.

Thank You.

sparkingdark commented 4 years ago

https://github.com/kerlomz/captcha_trainer I got this repo from GitHub,may be this will help you to detect captcha.

debdutgoswami commented 4 years ago

I am guessing that would work, but it's impossible for me to understand the README, as it is written in Chinese or Korean language, not quite sure. I guess I need to translate the page :(

sparkingdark commented 4 years ago

Project Introduction Verification Code Identification - This project is based on CNN5/ResNet+BLSTM/LSTM/GRU/SRU/BSRU+CTC for verification code identification. This project is only for training. If you need to deploy the model, please move: https://github.com/kerlomz/captcha_platform (通用WEB服务,HTTP请求调用) https://github.com/kerlomz/captcha_library_c (动态链接库,DLL调用,基于TensoFlow C++) https://github.com/kerlomz/captcha_demo_csharp (C#源码调用,基于TensorFlowSharp)

Many people ask me, does deployment identification also require a GPU? My answer is that it is completely unnecessary. Ideally, use GPU training, use CPU to deploy identification services, if deployment also requires such high cost, then what is the practical significance and application scenarios? The measured Alibaba Cloud minimum configuration with 1 core 1G configuration recognition 1 time about 30ms, I The i7-8700k is between 10-15ms.

Precautions How to use CPU training:

This project installs the TensorFlow-GPU version by default. It is recommended to use the GPU for training. If you need to switch to CPU training, please replace tensorflow-gpu==1.6.0 in the requirements.txt file for tensorflow==1.6.0, no need to change.

About the LSTM network:

Ensure that the featuremap input by CNN is at least three times larger than the maximum number of characters when input to LSTM, that is, time_step is greater than or equal to three times the maximum number of characters.

No valid path found Problem solving:

Modify the parameters of Pretreatment->Resize in model.yaml, adjust it to the appropriate value, and summarize the training experience of a hundred verification codes. You can try this more general value: Resize: [150, 50], or use the code tutorial .py (automatically generate configuration files, package samples, training integration), fill in the training set path execution.

Parameter modification:

Remember, ModelName is the only flag that binds a model. If you modify the training parameters such as ImageWidth, ImageHeight, Resize, CharSet, CNNNetwork, RecurrentNetwork, HiddenNum, etc., you need to delete the old file under the model path. Train, or retrain with the new ModelName, otherwise default as a breakpoint rework.

Ready to work If you are going to use GPU training, please install CUDA and cuDNN first, you can understand the officially compiled version corresponding to: https://www.tensorflow.org/install/install_sources#tested_source_configurations Github can be downloaded to third party compiled TensorFlow's WHL installation package:

Https://github.com/fo40225/tensorflow-windows-wheel

CUDA download address: https://developer.nvidia.com/cuda-downloads

cuDNN download address: https://developer.nvidia.com/rdp/form/cudnn-download-survey (requires registration)

The version I am using is: CUDA10+cuDNN7.3.1+TensorFlow 1.12 Environmental installation Install Python 3.6 environment (including pip)

Install the virtual environment virtualenv pip3 install virtualenv

Create a separate virtual environment for the project:

Virtualenv -p /usr/bin/python3 venv # venv is the name of the virtual environment. Cd venv/ # venv is the name of the virtual environment. Source bin/activate # to activate the current virtual environment. Cd captcha_trainer # captcha_trainer is the project path. Install the dependency list for this project: pip install -r requirements.txt Start

  1. Architecture and process This project depends on the training configuration config.yaml and the model configuration model.yaml. When initializing the project, please copy config_demo.yaml to the current directory and name it config.yaml, model_demo.yaml. Or you can use tutorial.py to automatically set up the model configuration.

Training process: After configuring two configuration files, execute the code in trains.py, read the configuration, build the neural network calculation graph according to the model.yaml configuration file, and train according to the configuration parameters of config.yaml.

There are several suggestions for the training parameters in config.yaml:

BatchSize (training batch size) and TestBatchSize (test batch size) are everyone's attention. It is recommended to adjust according to the graphics card conditions. The recommended small memory for BatchSize is not too large. TestBatchSize is also. The default configuration I provide is based on memory 8G. Please know that the usage rate is set at 50%.

LearningRate (Learning Rate) is also a concern. The essence of deep learning is to adjust the parameters. The general model can maintain the default configuration without adjustment. Some models want to obtain higher recognition accuracy. You can use 0.01 fast convergence first, and the accuracy is almost 95%. Use 0.001/0.0001 left and right to increase the accuracy. TestSetNum, which is designed for lazy people (say myself), cuts the training set according to the given number of test sets. There is a premise that the test set must be random, random, random, The important thing is to say three times, some people use Windows Explorer to open, drag and drop to select a few hundred, the default is sorted by name, if the name is a label, then it is not random, that is, you are likely to take The test set is an image labeled between 0 and 3, which may result in never being able to converge.

TrainRegex and TestRegex, the regular match, please collect the sample, try to be consistent with the example I gave, google, if it is for the 1111.jpg name, here is a batch conversion code: import re import os import hashlib

训练集路径

root = r"D:\TrainSet***" all_files = os.listdir(root)

for file in all_files: old_path = os.path.join(root, file)

# 已被修改过忽略
if len(file.split(".")[0]) > 32:
    continue

# 采用标注_文件md5码.图片后缀 进行命名
with open(old_path, "rb") as f:
    _id = hashlib.md5(f.read()).hexdigest()
new_path = os.path.join(root, file.replace(".", "_{}.".format(_id)))

# 重复标签的时候会出现形如:abcd (1).jpg 这种形式的文件名
new_path = re.sub(" \(\d+\)", "", new_path)
print(new_path)
os.rename(old_path, new_path)
  1. 配置化 model.yaml - Model Config

    - requirement.txt - GPU: tensorflow-gpu, CPU: tensorflow

    - If you use the GPU version, you need to install some additional applications.

    System: DeviceUsage: 0.7

    ModelName: Corresponding to the model file in the model directory,

    - such as YourModelName.pb, fill in YourModelName here.

    CharSet: Provides a default optional built-in solution:

    - [ALPHANUMERIC, ALPHANUMERIC_LOWER, ALPHANUMERIC_UPPER,

    -- NUMERIC, ALPHABET_LOWER, ALPHABET_UPPER, ALPHABET, ALPHANUMERIC_LOWER_MIX_CHINESE_3500]

    - Or you can use your own customized character set like: ['a', '1', '2'].

    CharMaxLength: Maximum length of characters, used for label padding.

    CharExclude: CharExclude should be a list, like: ['a', '1', '2']

    - which is convenient for users to freely combine character sets.

    - If you don't want to manually define the character set manually,

    - you can choose a built-in character set

    - and set the characters to be excluded by CharExclude parameter.

    Model: Sites: [ 'YourModelName' ] ModelName: YourModelName ModelType: 150x50 CharSet: ALPHANUMERIC_LOWER CharExclude: [] CharReplace: {} ImageWidth: 150 ImageHeight: 50

    Binaryzation: [-1: Off, >0 and < 255: On].

    Smoothing: [-1: Off, >0: On].

    Blur: [-1: Off, >0: On].

    Resize: [WIDTH, HEIGHT]

    - If the image size is too small, the training effect will be poor and you need to zoom in.

    ReplaceTransparent: [True, False]

    - True: Convert transparent images in RGBA format to opaque RGB format,

    - False: Keep the original image

    Pretreatment: Binaryzation: -1 Smoothing: -1 Blur: -1 Resize: [150, 50] ReplaceTransparent: True

    CNNNetwork: [CNN5, ResNet, DenseNet]

    RecurrentNetwork: [BLSTM, LSTM, SRU, BSRU, GRU]

    - The recommended configuration is CNN5+BLSTM / ResNet+BLSTM

    HiddenNum: [64, 128, 256]

    - This parameter indicates the number of nodes used to remember and store past states.

    Optimizer: Loss function algorithm for calculating gradient.

    - [AdaBound, Adam, Momentum]

    NeuralNet: CNNNetwork: CNN5 RecurrentNetwork: BLSTM HiddenNum: 64 KeepProb: 0.98 Optimizer: AdaBound PreprocessCollapseRepeated: False CTCMergeRepeated: True CTCBeamWidth: 1 CTCTopPaths: 1

    TrainsPath and TestPath: The local absolute path of your training and testing set.

    DatasetPath: Package a sample of the TFRecords format from this path.

    TrainRegex and TestRegex: Default matching apple_20181010121212.jpg file.

    - The Default is .?(?=_..)

    TestSetNum: This is an optional parameter that is used when you want to extract some of the test set

    - from the training set when you are not preparing the test set separately.

    SavedSteps: A Session.run() execution is called a Step,

    - Used to save training progress, Default value is 100.

    ValidationSteps: Used to calculate accuracy, Default value is 500.

    TestSetNum: The number of test sets, if an automatic allocation strategy is used (TestPath not set).

    EndAcc: Finish the training when the accuracy reaches [EndAcc*100]% and other conditions.

    EndCost: Finish the training when the cost reaches EndCost and other conditions.

    EndEpochs: Finish the training when the epoch is greater than the defined epoch and other conditions.

    BatchSize: Number of samples selected for one training step.

    TestBatchSize: Number of samples selected for one validation step.

    LearningRate: Recommended value[0.01: MomentumOptimizer/AdamOptimizer, 0.001: AdaBoundOptimizer]

    Trains: TrainsPath: './dataset/mnist-CNN5BLSTM-H64-28x28_trains.tfrecords' TestPath: './dataset/mnist-CNN5BLSTM-H64-28x28test.tfrecords' DatasetPath: [ "D:/**" ] TrainRegex: '.?(?=)' TestSetNum: 300 SavedSteps: 100 ValidationSteps: 500 EndAcc: 0.95 EndCost: 0.1 EndEpochs: 2 BatchSize: 128 TestBatchSize: 300 LearningRate: 0.001 DecayRate: 0.98 DecaySteps: 10000 Toolset Pre-processing preview tool, only supports viewing python -m tools.preview for packaged training sets

PyInstaller one-click packaging (supporting if the training is not good, the deployment package is good)

Pip install pyinstaller Python -m tools.package run Command line or terminal run: python trains.py Run with PyCharm, right click Run For beginners: Use the IDE tool to modify the tutorial.py configuration and run it, set the recommended configuration, package the sample, and run it in one. Detailed guide Previous articles written specifically for this project, welcome everyone to comment

Https://www.jianshu.com/p/80ef04b16efc

debdutgoswami commented 4 years ago

Thank you so much man. This was a great help. I would definitely go through this. Thanks again for the help.

sparkingdark commented 4 years ago

welcome