erikas-taroza / redshelf_downloader

Downloads textbooks from Redshelf
MIT License
4 stars 4 forks source link

help i keep getting this error #5

Open collegesucks opened 2 months ago

collegesucks commented 2 months ago

like the title says i keep getting this here is the fulPS D:\redshelf_downloader-master\redshelf_downloader-master> python scrape.py [Thread-1 (download_thread)] Downloading page 1 Exception in thread Thread-1 (download_thread): Traceback (most recent call last): File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64qbz5n2kfra8p0\Lib\threading.py", line 1075, in _bootstrap_inner self.run() File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64qbz5n2kfra8p0\Lib\threading.py", line 1012, in run self._target(*self._args, *self._kwargs) File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 115, in download_thread download_page(i) File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 82, in download_page base_url = get_base_url(raw) ^^^^^^^^^^^^^^^^^ File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 38, in get_base_url return re.search("<base href=\"(.?/(OPS|OEBPS)).\"/>", raw).group(1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'group' [Thread-2 (convert_thread)] Converting page 1 to PDF Exception in thread Thread-2 (convert_thread): Traceback (most recent call last): File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64qbz5n2kfra8p0\Lib\threading.py", line 1075, in _bootstrap_inner self.run() File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64qbz5n2kfra8p0\Lib\threading.py", line 1012, in run self._target(self._args, **self._kwargs) File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 121, in convert_thread convert_html_to_pdf(i) File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 89, in convert_html_to_pdf html = Path(f"{PAGE_PATH}/{page}/html/{page}.html").read_text(encoding="utf-8") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64qbz5n2kfra8p0\Lib\pathlib.py", line 1027, in read_text with self.open(mode='r', encoding=encoding, errors=errors) as f: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64__qbz5n2kfra8p0\Lib\pathlib.py", line 1013, in open return io.open(self, mode, buffering, encoding, errors, newline) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: 'pages\1\html\1.html' Merging PDF files Traceback (most recent call last): File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 152, in merge_pdf_files() File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 104, in merge_pdf_files main_pdf = pymupdf.open(Path(f"{PAGE_PATH}/1/1.pdf")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\mattt\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pymupdf\init.py", line 2763, in init__ raise FileNotFoundError(msg) pymupdf.FileNotFoundError: no such file: 'pages\1\1.pdf'

DukeRupert commented 2 months ago

This pull request may fix your issue. #6

erikas-taroza commented 2 months ago

Hello.

Can you please send the raw html that you get? You can do this by adding print(raw) after line 81:

def download_page(page: int):
    path = Path(f"{PAGE_PATH}/{page}")

    if not os.path.exists(path):
        os.mkdir(path)

    raw = get_raw_html(page)
+   print(raw)
    base_url = get_base_url(raw)
    remote_urls = get_remote_urls(raw)
    download_remote_resources(page, base_url, remote_urls)
    create_html_file(page, raw)
DaystromInst commented 2 months ago

Hello.

Can you please send the raw html that you get? You can do this by adding print(raw) after line 81:

def download_page(page: int):
    path = Path(f"{PAGE_PATH}/{page}")

    if not os.path.exists(path):
        os.mkdir(path)

    raw = get_raw_html(page)
+   print(raw)
    base_url = get_base_url(raw)
    remote_urls = get_remote_urls(raw)
    download_remote_resources(page, base_url, remote_urls)
    create_html_file(page, raw)

My friend has been having this issue and this is the raw html that he got: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:ccms="www.cengage.com/SOA/ccms" lang="en" xml:lang="en">

DukeRupert commented 2 months ago

Yeah, I’m busy but should be able to get that to you in a few hours.

On Tue, Sep 3, 2024 at 18:13, Jared Marriner @.***(mailto:On Tue, Sep 3, 2024 at 18:13, Jared Marriner < wrote:

Hello.

Can you please send the raw html that you get? You can do this by adding print(raw) after line 81:

def download_page(page: int): path = Path(f"{PAGE_PATH}/{page}")

if not os.path.exists(path):
    os.mkdir(path)

raw = get_raw_html(page)

+

print(raw)

base_url = get_base_url(raw) remote_urls = get_remote_urls(raw) download_remote_resources(page, base_url, remote_urls) create_html_file(page, raw)

My friend has been having this issue and this is the raw html that he got:

— Reply to this email directly, [view it on GitHub](https://github.com/erikas-taroza/redshelf_downloader/issues/5#issuecomment-2327660873), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AMLXATACFIO5CFVYFSZZ4MLZUZGDHAVCNFSM6AAAAABNOPEGNKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGY3DAOBXGM). You are receiving this because you commented.Message ID: ***@***.***>
erikas-taroza commented 2 months ago

Hello. Can you please send the raw html that you get? You can do this by adding print(raw) after line 81:

def download_page(page: int):
    path = Path(f"{PAGE_PATH}/{page}")

    if not os.path.exists(path):
        os.mkdir(path)

    raw = get_raw_html(page)
+   print(raw)
    base_url = get_base_url(raw)
    remote_urls = get_remote_urls(raw)
    download_remote_resources(page, base_url, remote_urls)
    create_html_file(page, raw)

My friend has been having this issue and this is the raw html that he got: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:ccms="www.cengage.com/SOA/ccms" lang="en" xml:lang="en">

This isn't the full html file. If that is what is being printed, try saving the file instead.

akoot commented 2 months ago

Hello. Can you please send the raw html that you get? You can do this by adding print(raw) after line 81:

def download_page(page: int):
    path = Path(f"{PAGE_PATH}/{page}")

    if not os.path.exists(path):
        os.mkdir(path)

    raw = get_raw_html(page)
+   print(raw)
    base_url = get_base_url(raw)
    remote_urls = get_remote_urls(raw)
    download_remote_resources(page, base_url, remote_urls)
    create_html_file(page, raw)

My friend has been having this issue and this is the raw html that he got: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:ccms="www.cengage.com/SOA/ccms" lang="en" xml:lang="en">

This isn't the full html file. If that is what is being printed, try saving the file instead.

I am having the same issue and this is what is in the html:

<!doctype html><html lang="en"><head><meta charset="utf-8"/><meta name="viewport" content="width=device-width,initial-scale=1"/><meta name="theme-color" content="#000000"/><meta name="description" content="RedShelf Reader"/><title>RedShelf Reader</title><link rel="shortcut icon" href="/read/static/images/favicon.ico"/><link rel="icon" type="image/png" href="/read/static/images/icon.png"/><link rel="icon" sizes="192x192" href="/read/static/images/touch-icon-192x192.png"/><link rel="apple-touch-icon-precomposed" sizes="180x180" href="/read/static/images/apple-touch-icon-180x180-precomposed.png"/><link rel="apple-touch-icon-precomposed" sizes="152x152" href="/read/static/images/apple-touch-icon-152x152-precomposed.png"/><link rel="apple-touch-icon-precomposed" sizes="120x120" href="/read/static/images/apple-touch-icon-120x120-precomposed.png"/><link rel="apple-touch-icon-precomposed" sizes="76x76" href="/read/static/images/apple-touch-icon-76x76-precomposed.png"/><link rel="apple-touch-icon-precomposed" href="/read/static/images/apple-touch-icon-precomposed.png"/><script>window.MathJax={options:{renderActions:{addMenu:[],checkLoading:[]},ignoreHtmlClass:"tex2jax_ignore",processHtmlClass:"tex2jax_process"},tex:{autoload:{color:[],colorV2:["color"]},packages:{"[+]":["noerrors"]}},loader:{load:["input/asciimath","[tex]/noerrors"]}}</script><script src="/read/static/js/MathJax-3.2.0/es5/tex-mml-chtml.js" id="MathJax-script"></script><script>if("serviceWorker"in navigator){window.addEventListener("load",(async()=>{try{window.readerServiceWorker=(await navigator.serviceWorker.register("/read/sw.js"))?.active}catch(e){console.error("RedShelf eReader Service Worker registration failed.",e)}}));let e=!1;navigator.serviceWorker.addEventListener("controllerchange",(()=>{e||(e=!0,window.location.reload())}))}</script><style>body{overscroll-behavior-y:none;overflow:hidden;position:fixed}</style><script defer="defer" src="/read/static/js/main.8e1079a1.js"></script><link href="/read/static/css/main.4b2ac2d4.css" rel="stylesheet"></head><body class="reader-body"><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div></body></html>
Comrade-Spood commented 1 month ago

Is this issue being worked on, or been fixed? I have a textbook I'd like to use this on and I do have a time limit to do it.

erikas-taroza commented 1 month ago

@Comrade-Spood Sorry I am a little busy with other things. Do you get the same html file as @akoot did? If it says you need javascript enabled then perhaps you are using an incorrect url to get the page.

Comrade-Spood commented 1 month ago

@Comrade-Spood Sorry I am a little busy with other things. Do you get the same html file as @akoot did? If it says you need javascript enabled then perhaps you are using an incorrect url to get the page.

I got what DaystromInst got. For some reason I could not figure out how to get it to give me the right html file.