apm1467 / videocr

Extract hardcoded subtitles from videos using machine learning
MIT License
509 stars 119 forks source link

Can anyone help me how to use? #23

Open Ruke805 opened 4 years ago

Ruke805 commented 4 years ago

I'm trying to understand how to make it work, but it's all very confusing. I'm using Windows 10, I already have Python installed, I already have tesseract working, added to PATH, but I don't know how to make it work. I tried to follow what is explained in this issue: https://github.com/apm1467/videocr/issues/2 I created the file get_sub.py

I put the video in the same folder, I put all the scripts in the same folder but when I run, I get this error:

Traceback (most recent call last): File "C:\Users\user\Programs\Python 3.7\venv\Lib\site-packages\videocr\get_sub.py", line 3, in import video File "C:\Users\user\Python 3.7\venv\Lib\site-packages\videocr\video.py", line 8, in from . import constants ImportError: attempted relative import with no known parent package

Someone please could help me?

bassSoul commented 4 years ago

I am trying to figure it out as well, despite having absolutely no background with this stuff. I think it's working as I type this. I would extract the videocr folder downloaded from here to your desktop instead (just simpler). Have you installed pip and then videocr, like suggested under the installation section?

We will try to figure this out together Edit: it ended up working for me like I said but the results were not great. I've moved on to using Subrip and FineReader, which is quite tedious.

johan456789 commented 4 years ago

I've moved on to using Subrip and FineReader, which is quite tedious.

@bassSoul How well does Subrip work?

bassSoul commented 4 years ago

@johan456789 Depends on the quality and font formatting of the text but overall does a good job if it's fairly standard. It sometimes generates duplicates or blanks, which you have to manually go through. This also could just be because the font I'm working with is brutal and I'm working with animation, which produces more false positives.

Note that ABBYY Finereader is required and subrip alone won't do the trick. You need to OCR the images into separate .txt files named according to your exported images. I believe only FineReader is capable of doing this as a batch export.

p2635 commented 4 years ago

@theruleof4 unfortunately even those of us who can make it work aren't getting results. I suggest you look at the comments above to see if you can use SubRip or FineReader instead.

Plaidstallion commented 3 years ago

I was wondering if Subtitle Edit was capable of doing any or all of this process. It can OCR PGS subtitles with Tessaract. Seems like it should technically be able to read the images put out by VideoSubFinder. I can't get subrip to work, personally.

Johaan01 commented 3 years ago

I dont know if im right, but isnt this code bricked due to the Tesseract Data File being moved? In the README you can fin this link: this page i was also checking some of the code on the constants.py file, some of the DatFile urls were also moved (and now the url give a 404 status), you can see for yourself: TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/master/{}.traineddata'

TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/master/script/{}.traineddata' i think it is possible to fix by updating the links but the since tesseract changed so much i doubt it would still work.

devmaxxing commented 2 years ago

FYI I've created a working fork that uses PaddleOCR instead of Tesseract: https://github.com/oliverfei/videocr-PaddleOCR