apm1467 / videocr

Extract hardcoded subtitles from videos using machine learning
MIT License
506 stars 117 forks source link

urllib.error.HTTPError: HTTP Error 404: Not Found #36

Open oregonpillow opened 2 years ago

oregonpillow commented 2 years ago
Traceback (most recent call last):
  File "run.py", line 7, in <module>
    print(get_subtitles(video, lang='chi_sim+eng', sim_threshold=70, conf_threshold=65))
  File "/home/ubuntu/Github/videocr/env/lib/python3.8/site-packages/videocr/api.py", line 8, in get_subtitles
    utils.download_lang_data(lang)
  File "/home/ubuntu/Github/videocr/env/lib/python3.8/site-packages/videocr/utils.py", line 21, in download_lang_data
    with urlopen(url) as res, open(filepath, 'w+b') as f:
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

not sure why this is happening. I'm guessing it's a version problem. Trying to run the example code with my own video (full system path specified)

@apm1467 Any chance you could provide the exact python version, tesseract version you used successfully?

Mschul commented 2 years ago

I am facing to exact same problem. I also tried fixing the urls referenced in constants.py TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/master/{}.traineddata' TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/master/script/{}.traineddata' since paths changed, but didnt solve the problem.

Mageikk commented 2 years ago

I had the same issue, and I don't know how to fix the automated download. However, if you simply go to https://github.com/tesseract-ocr/tessdata_best or https://github.com/tesseract-ocr/tessdata_fast, manually download the language files you need (so when in doubt just get all of them) and put them into the folder also referenced in constants.py you will not need the automated download anymore. Not perfect, but good enough for me

hw-lunemann commented 2 years ago

I ran into the same issue and putting

TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/blob/main/{}.traineddata?raw=true'

TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/blob/main/{}.traineddata?raw=true'

in constants.py fixes the downloading issue!

feanor3 commented 2 years ago

@hadis-git are you sure? it still gives me error. substituting {} with the language needed worked.

hw-lunemann commented 2 years ago

Yes, I am sure, The lang parameter in https://github.com/apm1467/videocr/blob/9b97c996570897b5a45d1f8b4f046aebcbcca300/videocr/api.py#L5 is split by '+', substituted into those constants. Then the models are downloaded here https://github.com/apm1467/videocr/blob/9b97c996570897b5a45d1f8b4f046aebcbcca300/videocr/utils.py#L9

So you have to make sure that your lang parameter corresponds to one or more of the available models.

hw-lunemann commented 2 years ago

What's the error you get?

feanor3 commented 2 years ago

Traceback (most recent call last): File "example.py", line 6, in videocr.save_subtitles_to_file('out.mkv', lang='dan') File "C:\Users\CrisMattGiov\AppData\Roaming\Python\Python38\site-packages\videocr\api.py", line 20, in save_subtitles_to_file f.write(get_subtitles( File "C:\Users\CrisMattGiov\AppData\Roaming\Python\Python38\site-packages\videocr\api.py", line 8, in get_subtitles utils.download_lang_data(lang) File "C:\Users\CrisMattGiov\AppData\Roaming\Python\Python38\site-packages\videocr\utils.py", line 21, in download_lang_data with urlopen(url) as res, open(filepath, 'w+b') as f: File "C:\Program Files\Python38\lib\urllib\request.py", line 222, in urlopen return opener.open(url, data, timeout) File "C:\Program Files\Python38\lib\urllib\request.py", line 531, in open response = meth(req, response) File "C:\Program Files\Python38\lib\urllib\request.py", line 640, in http_response response = self.parent.error( File "C:\Program Files\Python38\lib\urllib\request.py", line 569, in error return self._call_chain(args) File "C:\Program Files\Python38\lib\urllib\request.py", line 502, in _call_chain result = func(args) File "C:\Program Files\Python38\lib\urllib\request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 404: Not Found

hw-lunemann commented 2 years ago

Well, it's telling you that the url to the language models is wrong. How about you print(url) out that url right here? https://github.com/apm1467/videocr/blob/9b97c996570897b5a45d1f8b4f046aebcbcca300/videocr/utils.py#L20

Then you'll see if you edited the right constants.py

hsnfirdaus commented 2 years ago

This is because the branch name of tessdata_fast and tessdata_best changed from master to main, so the URL in file videocr/constants.py must changed, from :

TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/master/{}.traineddata'

TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/master/script/{}.traineddata'

to

TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/main/{}.traineddata'

TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/main/script/{}.traineddata'

we must wait for owner of this repository fix this issue, otherwise if you want to change it yourself, change this file in your pip library installation directory, in linux if you install using pip the directory is ~/.local/lib/python{version}/site-packages/videocr/ or /usr/local/lib/python{version}/dist-packages check in google for other OS.

xiaoliwang commented 2 years ago

It should move

TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/master/{}.traineddata'

TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/master/script/{}.traineddata'

to

TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/blob/main/{}.traineddata'

TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/blob/main/{}.traineddata'

now.

you can also download the traineddata file and put it to filepath as well.