Open johnlockejrr opened 2 weeks ago
Changed 'torch==1.13.0+cu116' to 'torch>=1.13' etc. and it installed. Now it hangs since about 40 minutes with all my GPU at 100% on:
2024-11-06 14:18:03,658 INFO total_param is 53486096
2024-11-06 14:18:04,048 INFO Loading train loader...
2024-11-06 14:18:04,949 INFO Loading val loader...
Lowered a little the batch size of train/eval and the trainer started quickly.
How to interfere with the model after training?
I'm here right now, should I stop training or keep it going?
2024-11-07 12:40:19,611 INFO Val. loss : 4.357 CER : 0.1706 WER : 0.4524
2024-11-07 12:44:33,318 INFO Iter : 28100 LR : 0.00081 training loss : 21.38781
2024-11-07 12:48:47,350 INFO Iter : 28200 LR : 0.00081 training loss : 20.09475
2024-11-07 12:53:00,896 INFO Iter : 28300 LR : 0.00081 training loss : 21.57617
2024-11-07 12:57:15,007 INFO Iter : 28400 LR : 0.00081 training loss : 21.11300
2024-11-07 13:01:28,937 INFO Iter : 28500 LR : 0.00081 training loss : 19.47327
2024-11-07 13:05:43,701 INFO Iter : 28600 LR : 0.00081 training loss : 21.46478
2024-11-07 13:09:57,206 INFO Iter : 28700 LR : 0.00081 training loss : 20.34710
2024-11-07 13:14:11,406 INFO Iter : 28800 LR : 0.00081 training loss : 20.37044
2024-11-07 13:18:25,441 INFO Iter : 28900 LR : 0.00080 training loss : 21.23569
2024-11-07 13:22:40,404 INFO Iter : 29000 LR : 0.00080 training loss : 20.43095
2024-11-07 13:25:10,455 INFO WER improved from 0.4505 to 0.4501!!!
2024-11-07 13:25:11,027 INFO Val. loss : 4.360 CER : 0.1713 WER : 0.4501
2024-11-07 13:29:24,362 INFO Iter : 29100 LR : 0.00080 training loss : 19.21192
2024-11-07 13:33:38,064 INFO Iter : 29200 LR : 0.00080 training loss : 18.39133
2024-11-07 13:37:51,473 INFO Iter : 29300 LR : 0.00080 training loss : 21.14774
2024-11-07 13:42:03,594 INFO Iter : 29400 LR : 0.00080 training loss : 19.76837
2024-11-07 13:46:16,520 INFO Iter : 29500 LR : 0.00080 training loss : 18.73728
2024-11-07 13:50:30,535 INFO Iter : 29600 LR : 0.00080 training loss : 21.03761
2024-11-07 13:54:44,174 INFO Iter : 29700 LR : 0.00079 training loss : 21.91326
2024-11-07 13:58:57,444 INFO Iter : 29800 LR : 0.00079 training loss : 19.83856
2024-11-07 14:03:12,634 INFO Iter : 29900 LR : 0.00079 training loss : 19.21149
2024-11-07 14:07:25,860 INFO Iter : 30000 LR : 0.00079 training loss : 18.59232
2024-11-07 14:09:51,321 INFO Val. loss : 4.206 CER : 0.1677 WER : 0.4525
2024-11-07 14:14:04,789 INFO Iter : 30100 LR : 0.00079 training loss : 20.52481
2024-11-07 14:18:18,877 INFO Iter : 30200 LR : 0.00079 training loss : 21.18321
2024-11-07 14:22:32,695 INFO Iter : 30300 LR : 0.00079 training loss : 21.03689
2024-11-07 14:26:47,309 INFO Iter : 30400 LR : 0.00078 training loss : 21.53148
2024-11-07 14:31:01,122 INFO Iter : 30500 LR : 0.00078 training loss : 19.73960
2024-11-07 14:35:14,995 INFO Iter : 30600 LR : 0.00078 training loss : 20.60052
2024-11-07 14:39:29,221 INFO Iter : 30700 LR : 0.00078 training loss : 20.76044
2024-11-07 14:43:43,012 INFO Iter : 30800 LR : 0.00078 training loss : 19.76154
2024-11-07 14:47:59,959 INFO Iter : 30900 LR : 0.00078 training loss : 21.51893
Any help on how to interfere with the trained model? I know is a VIT model but as far as I can see is "modded".
Hi, thank you for your interest in our project!
I'm glad to hear that you've already solved some problems. However, I didn't quite understand your question about 'how to interfere.' Did you mean 'how to inference after training'? If so, you can run test.py using the command in the read.sh file:
python3 test.py --exp-name read \ --max-lr 1e-3 \ --train-bs 128 \ --val-bs 8 \ --weight-decay 0.5 \ --mask-ratio 0.4 \ --attn-mask-ratio 0.1 \ --max-span-length 8 \ --img-size 512 64 \ --proj 8 \ --dila-ero-max-kernel 2 \ --dila-ero-iter 1 \ --proba 0.5 \ --alpha 1 \ --total-iter 100000 \ READ
Feel free to ask for help if you have any other questions~
Best, Yuting
Thank you for your reply! I mean after training, how can I infere with the model to recognize from an image, not testing.
Something like this (TrOCR
in the example):
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
import requests
from PIL import Image
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")
# load image from the IAM dataset
url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
I came up with this code, I'm not sure if I'm correct, should ralph
contain all the characters in the train dataset?
I find it difficult because in themodel.HTR_VT
library I have only create_model
not something like load_model
.
import torch
import argparse
import os
import re
from PIL import Image
from collections import OrderedDict
from utils import utils
from model import HTR_VT
from torchvision import transforms
def load_model(model_path, device, nb_cls=92, img_size=(512, 64)):
# Initialize the model
model = HTR_VT.create_model(nb_cls=nb_cls, img_size=img_size[::-1])
# Load the checkpoint
ckpt = torch.load(model_path, map_location=device)
model_dict = OrderedDict()
pattern = re.compile('module.')
# Process the checkpoint to match model keys
for k, v in ckpt['state_dict_ema'].items():
if re.search("module", k):
model_dict[re.sub(pattern, '', k)] = v
else:
model_dict[k] = v
# Filter out incompatible keys
pretrained_dict = {k: v for k, v in model_dict.items() if k in model.state_dict() and model.state_dict()[k].shape == v.shape}
model.load_state_dict(pretrained_dict, strict=False) # strict=False allows skipping incompatible layers
model = model.to(device)
model.eval()
return model
from torchvision import transforms
def preprocess_image(image_path, img_size=(512, 64)):
# Load the image
image = Image.open(image_path).convert('L') # Convert to grayscale
# Resize the image
image = image.resize(img_size)
# Convert image to tensor and normalize
transform = transforms.Compose([
transforms.ToTensor(), # Convert to Tensor (scales values to [0, 1])
transforms.Normalize(mean=[0.5], std=[0.5]) # Normalize to [-1, 1] (optional, can adjust as needed)
])
image_tensor = transform(image).unsqueeze(0) # Add batch dimension
return image_tensor
def infer_text(model, image_tensor, device, converter):
image_tensor = image_tensor.to(device)
with torch.no_grad():
preds = model(image_tensor)
preds = preds.permute(1, 0, 2).contiguous() # Adjust dimensions for decoding
_, preds_index = preds.max(2)
# Assume length is the maximum time steps for each item in the batch
length = [preds_index.size(0)] * preds_index.size(1)
# Decode the predictions
preds_str = converter.decode(preds_index, length)
return preds_str
def main():
parser = argparse.ArgumentParser(description="HTR_VT Inference")
parser.add_argument('--model-path', type=str, required=True, help="Path to the trained model .pth file")
parser.add_argument('--image-path', type=str, required=True, help="Path to the input image for inference")
args = parser.parse_args()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Load the model
model = load_model(args.model_path, device)
# Convert characters for decoding
ralph = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" # Example character set
converter = utils.CTCLabelConverter(ralph)
# Preprocess the image
image_tensor = preprocess_image(args.image_path)
# Inference
recognized_text = infer_text(model, image_tensor, device, converter)
print("Recognized Text:", recognized_text)
if __name__ == '__main__':
main()
I'm totally wrong I believe. I run my script with your trained model on one of your test images and...
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
infere.py:16: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
ckpt = torch.load(model_path, map_location=device)
Recognized Text: ['lwlglolUlwlKlklgKgllxlglllUlelgl0lgwjwlTlglxlglll']
Update:
Defenitely I'm doing something wrong. Any time I run the script I get a different output:
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
Recognized Text: ['Quv01cVvcVCnz51vcv9']
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
Recognized Text: ['zpzSz4zzLzxzzfzjzLzTzzjzozLzWzzLxLzzzxz4']
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ vi infere.py
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
Recognized Text: ['1N1a11G19111F11L1F141G11K111L11aA']
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
Recognized Text: ['FETxHmCfQPQPgvQxC']
(htr) incognito@DESKTOP-H1BS9PO:~/HTR-VT$ python infere.py --model-path YutingLi0606-best_CER.pth --image-path htr.png
Recognized Text: ['yyxyAyCyy3yy3yryyyryCymywyByyyyxy']
I could use some help.
Ok, with such support I think I just have to move on...
Hi, in the valid.py line 42 preds_str = converter.decode(preds_index.data, preds_size.data) You can try to print preds_str.
Hope it helps, Yuting
Isn't there a script to just have an image and recognize the text from the new trained model? I don't want to re-validate, I just want to infere
, use the model. Is hartd to use it because is not a modified Resnet-18 model.
A simple sample of code on how to load and use the model?