apm1467 / videocr

Extract hardcoded subtitles from videos using machine learning
MIT License
506 stars 117 forks source link

ValueError: invalid literal for int() with base 10 #28

Open bzh4bzh opened 3 years ago

bzh4bzh commented 3 years ago

ValueError: 'invalid literal for int() with base 10: \'"₪ץ\'' (several different words get caught here)

function get_subtitles in api.py at line 11 v.run_ocr(lang, time_start, time_end, conf_threshold, use_fullframe) function run_ocr in video.py at line 52 for i, data in enumerate(it_ocr) function in video.py at line 52 for i, data in enumerate(it_ocr) function init in models.py at line 32 block_num, conf = int(block_num), int(conf)

MadHundred commented 3 years ago

same problem python3.8 tesseract-ocr-w64-v5.0.0-alpha.20201127

HarryRudolph commented 3 years ago

I also had this same problem, seemed to only be with reading Hebrew. Could be a right to left thing?

MadHundred commented 3 years ago

@HarryRudolph Yes , I think it's a right to left languages problem. Error Log : ValueError: invalid literal for int() with base 10: 'ارره'

That ارره is a Persian word , it seems have a problem on RTL languages.

Code : print(get_subtitles('video.mp4', lang='fas', sim_threshold=70, conf_threshold=65))

MadHundred commented 3 years ago

a debug from models.py with print of word_data print(word_data):

            word_data = l.split()
            print(word_data) // <-- this line added
            if len(word_data) < 12:

this is the last lines that got an error :

['4', '1', '1', '1', '2', '0', '607', '76', '111', '74', '-1']

['5', '1', '1', '1', '2', '1', '607', '76', '111', '97', '20', '4']

['4', '1', '1', '1', '3', '0', '217', '169', '486', '71', '-1']

['5', '1', '1', '1', '3', '1', '306', '162', '212', '78', '1', 'لارنج']

['5', '1', '1', '1', '3', '2', '191', '189', '100', '51', '0', 'ارره', '\u200f']

Program crash when word_data got 13 column instead of 12. So I added a skip for more than 13 columns with this :

if len(word_data) > 12:
     continue

Program will work until end but the result at end is just an half a line .

PlaylistsTrance commented 3 years ago

In models.py, replace line 32: block_num, conf = int(block_num), int(conf) with block_num, conf = int(block_num), int(float(conf)). The issue is that conf is a string of a float value, which int() is not able to convert. By doing float(conf), the float value string is correctly converted into a float, which is able to be converted to an int with int().

HarryRudolph commented 3 years ago

@PlaylistsTrance Your solution leads to this error:

block_num, conf = int(block_num), int(float(conf))
ValueError: could not convert string to float: 'שם'

It seems that for some reason the OCRed text is being stored in conf? I am assuming this is incorrect and that conf should be storing an integer/float representing percentage confidence.

The assignment in line 31 of models.py is maybe getting confused with the right to left text? _, _, block_num, *_, conf, text = word_data

MadHundred commented 3 years ago

@HarryRudolph I've check parameters that given from Tesseract and it seems the problems are just with this two :

Problem 1 : On RTL languages we got one more parameter that indicate it's RTL. some word_data have 13 parameter instead of 12. So add this line after `if len(word_data) < 12:

no word is predicted

            continue` will solve this.
            if len(word_data) == 13:
                _, _, block_num, *_, conf, text, _ = word_data
            else:
                _, _, block_num, *_, conf, text = word_data

Problem 2 : Some of lines got a confidence value in float or StringFloat that got an error of invalid literal for int() with base 10. To solve this I've added a method (is_float) to check if conf is float or not with this after __init__ :

        def is_float(value):
            try:
                float(value)
                return True
            except:
                return False

And replace block_num, conf = int(block_num), int(conf) with below codes :

            if is_float(conf):
                block_num, conf = int(block_num), int(float(conf))
            else:
                block_num, conf = int(block_num), int(conf)

Result : Program will run without any error but I've just tested this with Arabic/Persian languages but it seems the Tesseract don't get a good OCR on them and the result is not what I want. Please test it on other languages like Hebrew and feedback.