YutingLi0606 / HTR-VT

(Pattern Recognition) Pytorch implementation of “HTR-VT: Handwritten Text Recognition with Vision Transformer”
https://yutingli0606.github.io/HTR-VT/
41 stars 5 forks source link

Can't install under miniconda #3

Open johnlockejrr opened 2 weeks ago

johnlockejrr commented 2 weeks ago
Pip subprocess error:
ERROR: Ignored the following versions that require a different python version: 0.22.0 Requires-Python >=3.9; 0.22.0rc1 Requires-Python >=3.9; 0.23.0 Requires-Python >=3.10; 0.23.0rc0 Requires-Python >=3.10; 0.23.0rc2 Requires-Python >=3.10; 0.23.1 Requires-Python >=3.10; 0.23.2 Requires-Python >=3.10; 0.23.2rc1 Requires-Python >=3.10; 0.24.0 Requires-Python >=3.9; 0.24.0rc1 Requires-Python >=3.9; 0.25.0rc0 Requires-Python >=3.10; 0.25.0rc1 Requires-Python >=3.10; 1.11.0 Requires-Python <3.13,>=3.9; 1.11.0rc1 Requires-Python <3.13,>=3.9; 1.11.0rc2 Requires-Python <3.13,>=3.9; 1.11.1 Requires-Python <3.13,>=3.9; 1.11.2 Requires-Python <3.13,>=3.9; 1.11.3 Requires-Python <3.13,>=3.9; 1.11.4 Requires-Python >=3.9; 1.12.0 Requires-Python >=3.9; 1.12.0rc1 Requires-Python >=3.9; 1.12.0rc2 Requires-Python >=3.9; 1.13.0 Requires-Python >=3.9; 1.13.0rc1 Requires-Python >=3.9; 1.13.1 Requires-Python >=3.9; 1.14.0 Requires-Python >=3.10; 1.14.0rc1 Requires-Python >=3.10; 1.14.0rc2 Requires-Python >=3.10; 1.14.1 Requires-Python >=3.10; 1.2.0 Requires-Python >=3.9; 1.2.1 Requires-Python >=3.9; 1.2.1rc1 Requires-Python >=3.9; 1.25.0 Requires-Python >=3.9; 1.25.1 Requires-Python >=3.9; 1.25.2 Requires-Python >=3.9; 1.26.0 Requires-Python <3.13,>=3.9; 1.26.1 Requires-Python <3.13,>=3.9; 1.26.2 Requires-Python >=3.9; 1.26.3 Requires-Python >=3.9; 1.26.4 Requires-Python >=3.9; 1.3.0 Requires-Python >=3.9; 1.5.0 Requires-Python >=3.9; 1.6.0 Requires-Python >=3.9; 1.6.0rc1 Requires-Python >=3.9; 1.7.0 Requires-Python >=3.10; 11.0.0 Requires-Python >=3.9; 2.0.0 Requires-Python >=3.9; 2.0.1 Requires-Python >=3.9; 2.0.2 Requires-Python >=3.9; 2.1.0 Requires-Python >=3.10; 2.1.0rc1 Requires-Python >=3.10; 2.1.1 Requires-Python >=3.10; 2.1.2 Requires-Python >=3.10; 2.1.3 Requires-Python >=3.10; 2.14.1 Requires-Python >=3.9; 2.15.0 Requires-Python >=3.9; 2.15.1 Requires-Python >=3.9; 2.15.2 Requires-Python >=3.9; 2.16.0 Requires-Python >=3.9; 2.16.1 Requires-Python >=3.9; 2.16.2 Requires-Python >=3.9; 2.17.0 Requires-Python >=3.9; 2.17.1 Requires-Python >=3.9; 2.18.0 Requires-Python >=3.9; 2.36.0 Requires-Python >=3.9; 2023.12.9 Requires-Python >=3.9; 2023.7.18 Requires-Python >=3.9; 2023.8.12 Requires-Python >=3.9; 2023.8.25 Requires-Python >=3.9; 2023.8.30 Requires-Python >=3.9; 2023.9.18 Requires-Python >=3.9; 2023.9.26 Requires-Python >=3.9; 2024.1.30 Requires-Python >=3.9; 2024.2.12 Requires-Python >=3.9; 2024.4.18 Requires-Python >=3.9; 2024.4.24 Requires-Python >=3.9; 2024.5.10 Requires-Python >=3.9; 2024.5.22 Requires-Python >=3.9; 2024.5.3 Requires-Python >=3.9; 2024.6.18 Requires-Python >=3.9; 2024.7.2 Requires-Python >=3.9; 2024.7.21 Requires-Python >=3.9; 2024.7.24 Requires-Python >=3.9; 2024.8.10 Requires-Python >=3.9; 2024.8.24 Requires-Python >=3.9; 2024.8.28 Requires-Python >=3.9; 2024.8.30 Requires-Python >=3.9; 2024.9.20 Requires-Python >=3.10; 3.0.0 Requires-Python >=3.9; 3.0.1 Requires-Python >=3.9; 3.0.2 Requires-Python >=3.9; 3.10.0rc1 Requires-Python >=3.10; 3.2 Requires-Python >=3.9; 3.2.0 Requires-Python >=3.9; 3.2.0b1 Requires-Python >=3.9; 3.2.0b2 Requires-Python >=3.9; 3.2.0b3 Requires-Python >=3.9; 3.2.0rc1 Requires-Python >=3.9; 3.2.1 Requires-Python >=3.9; 3.2rc0 Requires-Python >=3.9; 3.3 Requires-Python >=3.10; 3.3rc0 Requires-Python >=3.10; 3.4 Requires-Python >=3.10; 3.4.1 Requires-Python >=3.10; 3.4.2 Requires-Python >=3.10; 3.4rc0 Requires-Python >=3.10; 3.8.0 Requires-Python >=3.9; 3.8.0rc1 Requires-Python >=3.9; 3.8.1 Requires-Python >=3.9; 3.8.2 Requires-Python >=3.9; 3.8.3 Requires-Python >=3.9; 3.8.4 Requires-Python >=3.9; 3.9.0 Requires-Python >=3.9; 3.9.0rc2 Requires-Python >=3.9; 3.9.1 Requires-Python >=3.9; 3.9.1.post1 Requires-Python >=3.9; 3.9.2 Requires-Python >=3.9
ERROR: Could not find a version that satisfies the requirement torch==1.13.0+cu116 (from versions: 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1)
ERROR: No matching distribution found for torch==1.13.0+cu116

failed

CondaEnvException: Pip failed
johnlockejrr commented 2 weeks ago

Changed 'torch==1.13.0+cu116' to 'torch>=1.13' etc. and it installed. Now it hangs since about 40 minutes with all my GPU at 100% on:

2024-11-06 14:18:03,658 INFO total_param is 53486096
2024-11-06 14:18:04,048 INFO Loading train loader...
2024-11-06 14:18:04,949 INFO Loading val loader...
johnlockejrr commented 2 weeks ago

Lowered a little the batch size of train/eval and the trainer started quickly.

johnlockejrr commented 2 weeks ago

How to interfere with the model after training?

johnlockejrr commented 2 weeks ago

I'm here right now, should I stop training or keep it going?

2024-11-07 12:40:19,611 INFO Val. loss : 4.357   CER : 0.1706    WER : 0.4524
2024-11-07 12:44:33,318 INFO Iter : 28100        LR : 0.00081    training loss : 21.38781
2024-11-07 12:48:47,350 INFO Iter : 28200        LR : 0.00081    training loss : 20.09475
2024-11-07 12:53:00,896 INFO Iter : 28300        LR : 0.00081    training loss : 21.57617
2024-11-07 12:57:15,007 INFO Iter : 28400        LR : 0.00081    training loss : 21.11300
2024-11-07 13:01:28,937 INFO Iter : 28500        LR : 0.00081    training loss : 19.47327
2024-11-07 13:05:43,701 INFO Iter : 28600        LR : 0.00081    training loss : 21.46478
2024-11-07 13:09:57,206 INFO Iter : 28700        LR : 0.00081    training loss : 20.34710
2024-11-07 13:14:11,406 INFO Iter : 28800        LR : 0.00081    training loss : 20.37044
2024-11-07 13:18:25,441 INFO Iter : 28900        LR : 0.00080    training loss : 21.23569
2024-11-07 13:22:40,404 INFO Iter : 29000        LR : 0.00080    training loss : 20.43095
2024-11-07 13:25:10,455 INFO WER improved from 0.4505 to 0.4501!!!
2024-11-07 13:25:11,027 INFO Val. loss : 4.360   CER : 0.1713    WER : 0.4501
2024-11-07 13:29:24,362 INFO Iter : 29100        LR : 0.00080    training loss : 19.21192
2024-11-07 13:33:38,064 INFO Iter : 29200        LR : 0.00080    training loss : 18.39133
2024-11-07 13:37:51,473 INFO Iter : 29300        LR : 0.00080    training loss : 21.14774
2024-11-07 13:42:03,594 INFO Iter : 29400        LR : 0.00080    training loss : 19.76837
2024-11-07 13:46:16,520 INFO Iter : 29500        LR : 0.00080    training loss : 18.73728
2024-11-07 13:50:30,535 INFO Iter : 29600        LR : 0.00080    training loss : 21.03761
2024-11-07 13:54:44,174 INFO Iter : 29700        LR : 0.00079    training loss : 21.91326
2024-11-07 13:58:57,444 INFO Iter : 29800        LR : 0.00079    training loss : 19.83856
2024-11-07 14:03:12,634 INFO Iter : 29900        LR : 0.00079    training loss : 19.21149
2024-11-07 14:07:25,860 INFO Iter : 30000        LR : 0.00079    training loss : 18.59232
2024-11-07 14:09:51,321 INFO Val. loss : 4.206   CER : 0.1677    WER : 0.4525
2024-11-07 14:14:04,789 INFO Iter : 30100        LR : 0.00079    training loss : 20.52481
2024-11-07 14:18:18,877 INFO Iter : 30200        LR : 0.00079    training loss : 21.18321
2024-11-07 14:22:32,695 INFO Iter : 30300        LR : 0.00079    training loss : 21.03689
2024-11-07 14:26:47,309 INFO Iter : 30400        LR : 0.00078    training loss : 21.53148
2024-11-07 14:31:01,122 INFO Iter : 30500        LR : 0.00078    training loss : 19.73960
2024-11-07 14:35:14,995 INFO Iter : 30600        LR : 0.00078    training loss : 20.60052
2024-11-07 14:39:29,221 INFO Iter : 30700        LR : 0.00078    training loss : 20.76044
2024-11-07 14:43:43,012 INFO Iter : 30800        LR : 0.00078    training loss : 19.76154
2024-11-07 14:47:59,959 INFO Iter : 30900        LR : 0.00078    training loss : 21.51893
johnlockejrr commented 2 weeks ago

Any help on how to interfere with the trained model? I know is a VIT model but as far as I can see is "modded".

YutingLi0606 commented 2 weeks ago

Hi, thank you for your interest in our project!

I'm glad to hear that you've already solved some problems. However, I didn't quite understand your question about 'how to interfere.' Did you mean 'how to inference after training'? If so, you can run test.py using the command in the read.sh file:

python3 test.py --exp-name read \ --max-lr 1e-3 \ --train-bs 128 \ --val-bs 8 \ --weight-decay 0.5 \ --mask-ratio 0.4 \ --attn-mask-ratio 0.1 \ --max-span-length 8 \ --img-size 512 64 \ --proj 8 \ --dila-ero-max-kernel 2 \ --dila-ero-iter 1 \ --proba 0.5 \ --alpha 1 \ --total-iter 100000 \ READ

Feel free to ask for help if you have any other questions~

Best, Yuting

johnlockejrr commented 2 weeks ago

Thank you for your reply! I mean after training, how can I infere with the model to recognize from an image, not testing. Something like this (TrOCR in the example):

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
import requests
from PIL import Image

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")

# load image from the IAM dataset
url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
 image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

pixel_values = processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
johnlockejrr commented 2 weeks ago

I came up with this code, I'm not sure if I'm correct, should ralph contain all the characters in the train dataset? I find it difficult because in themodel.HTR_VT library I have only create_model not something like load_model.

import torch
import argparse
import os
import re
from PIL import Image
from collections import OrderedDict
from utils import utils
from model import HTR_VT
from torchvision import transforms

def load_model(model_path, device, nb_cls=92, img_size=(512, 64)):
    # Initialize the model
    model = HTR_VT.create_model(nb_cls=nb_cls, img_size=img_size[::-1])

    # Load the checkpoint
    ckpt = torch.load(model_path, map_location=device)
    model_dict = OrderedDict()
    pattern = re.compile('module.')

    # Process the checkpoint to match model keys
    for k, v in ckpt['state_dict_ema'].items():
        if re.search("module", k):
            model_dict[re.sub(pattern, '', k)] = v
        else:
            model_dict[k] = v

    # Filter out incompatible keys
    pretrained_dict = {k: v for k, v in model_dict.items() if k in model.state_dict() and model.state_dict()[k].shape == v.shape}
    model.load_state_dict(pretrained_dict, strict=False)  # strict=False allows skipping incompatible layers
    model = model.to(device)
    model.eval()
    return model

from torchvision import transforms

def preprocess_image(image_path, img_size=(512, 64)):
    # Load the image
    image = Image.open(image_path).convert('L')  # Convert to grayscale

    # Resize the image
    image = image.resize(img_size)

    # Convert image to tensor and normalize
    transform = transforms.Compose([
        transforms.ToTensor(),  # Convert to Tensor (scales values to [0, 1])
        transforms.Normalize(mean=[0.5], std=[0.5])  # Normalize to [-1, 1] (optional, can adjust as needed)
    ])

    image_tensor = transform(image).unsqueeze(0)  # Add batch dimension
    return image_tensor

def infer_text(model, image_tensor, device, converter):
    image_tensor = image_tensor.to(device)
    with torch.no_grad():
        preds = model(image_tensor)

    preds = preds.permute(1, 0, 2).contiguous()  # Adjust dimensions for decoding
    _, preds_index = preds.max(2)

    # Assume length is the maximum time steps for each item in the batch
    length = [preds_index.size(0)] * preds_index.size(1)

    # Decode the predictions
    preds_str = converter.decode(preds_index, length)
    return preds_str

def main():
    parser = argparse.ArgumentParser(description="HTR_VT Inference")
    parser.add_argument('--model-path', type=str, required=True, help="Path to the trained model .pth file")
    parser.add_argument('--image-path', type=str, required=True, help="Path to the input image for inference")
    args = parser.parse_args()

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    # Load the model
    model = load_model(args.model_path, device)

    # Convert characters for decoding
    ralph = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"  # Example character set
    converter = utils.CTCLabelConverter(ralph)

    # Preprocess the image
    image_tensor = preprocess_image(args.image_path)

    # Inference
    recognized_text = infer_text(model, image_tensor, device, converter)

    print("Recognized Text:", recognized_text)

if __name__ == '__main__':
    main()
johnlockejrr commented 2 weeks ago

I'm totally wrong I believe. I run my script with your trained model on one of your test images and...

(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
infere.py:16: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  ckpt = torch.load(model_path, map_location=device)
Recognized Text: ['lwlglolUlwlKlklgKgllxlglllUlelgl0lgwjwlTlglxlglll']
htr
johnlockejrr commented 2 weeks ago

Update:

Defenitely I'm doing something wrong. Any time I run the script I get a different output:

(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
Recognized Text: ['Quv01cVvcVCnz51vcv9']
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
Recognized Text: ['zpzSz4zzLzxzzfzjzLzTzzjzozLzWzzLxLzzzxz4']
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ vi infere.py
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
Recognized Text: ['1N1a11G19111F11L1F141G11K111L11aA']
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
Recognized Text: ['FETxHmCfQPQPgvQxC']
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
Recognized Text: ['yyxyAyCyy3yy3yryyyryCymywyByyyyxy']
johnlockejrr commented 1 week ago

I could use some help.

johnlockejrr commented 1 week ago

Ok, with such support I think I just have to move on...

YutingLi0606 commented 4 days ago

Hi, in the valid.py line 42 preds_str = converter.decode(preds_index.data, preds_size.data) You can try to print preds_str.

Hope it helps, Yuting

johnlockejrr commented 10 hours ago

Isn't there a script to just have an image and recognize the text from the new trained model? I don't want to re-validate, I just want to infere, use the model. Is hartd to use it because is not a modified Resnet-18 model.

A simple sample of code on how to load and use the model?