Open collegesucks opened 2 months ago
This pull request may fix your issue. #6
Hello.
Can you please send the raw html that you get? You can do this by adding print(raw)
after line 81:
def download_page(page: int):
path = Path(f"{PAGE_PATH}/{page}")
if not os.path.exists(path):
os.mkdir(path)
raw = get_raw_html(page)
+ print(raw)
base_url = get_base_url(raw)
remote_urls = get_remote_urls(raw)
download_remote_resources(page, base_url, remote_urls)
create_html_file(page, raw)
Hello.
Can you please send the raw html that you get? You can do this by adding
print(raw)
after line 81:def download_page(page: int): path = Path(f"{PAGE_PATH}/{page}") if not os.path.exists(path): os.mkdir(path) raw = get_raw_html(page) + print(raw) base_url = get_base_url(raw) remote_urls = get_remote_urls(raw) download_remote_resources(page, base_url, remote_urls) create_html_file(page, raw)
My friend has been having this issue and this is the raw html that he got:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:ccms="www.cengage.com/SOA/ccms" lang="en" xml:lang="en">
Yeah, I’m busy but should be able to get that to you in a few hours.
On Tue, Sep 3, 2024 at 18:13, Jared Marriner @.***(mailto:On Tue, Sep 3, 2024 at 18:13, Jared Marriner < wrote:
Hello.
Can you please send the raw html that you get? You can do this by adding print(raw) after line 81:
def download_page(page: int): path = Path(f"{PAGE_PATH}/{page}")
if not os.path.exists(path): os.mkdir(path) raw = get_raw_html(page)
+
print(raw)
base_url = get_base_url(raw) remote_urls = get_remote_urls(raw) download_remote_resources(page, base_url, remote_urls) create_html_file(page, raw)
My friend has been having this issue and this is the raw html that he got:
— Reply to this email directly, [view it on GitHub](https://github.com/erikas-taroza/redshelf_downloader/issues/5#issuecomment-2327660873), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AMLXATACFIO5CFVYFSZZ4MLZUZGDHAVCNFSM6AAAAABNOPEGNKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGY3DAOBXGM). You are receiving this because you commented.Message ID: ***@***.***>
Hello. Can you please send the raw html that you get? You can do this by adding
print(raw)
after line 81:def download_page(page: int): path = Path(f"{PAGE_PATH}/{page}") if not os.path.exists(path): os.mkdir(path) raw = get_raw_html(page) + print(raw) base_url = get_base_url(raw) remote_urls = get_remote_urls(raw) download_remote_resources(page, base_url, remote_urls) create_html_file(page, raw)
My friend has been having this issue and this is the raw html that he got:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:ccms="www.cengage.com/SOA/ccms" lang="en" xml:lang="en">
This isn't the full html file. If that is what is being printed, try saving the file instead.
Hello. Can you please send the raw html that you get? You can do this by adding
print(raw)
after line 81:def download_page(page: int): path = Path(f"{PAGE_PATH}/{page}") if not os.path.exists(path): os.mkdir(path) raw = get_raw_html(page) + print(raw) base_url = get_base_url(raw) remote_urls = get_remote_urls(raw) download_remote_resources(page, base_url, remote_urls) create_html_file(page, raw)
My friend has been having this issue and this is the raw html that he got:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:ccms="www.cengage.com/SOA/ccms" lang="en" xml:lang="en">
This isn't the full html file. If that is what is being printed, try saving the file instead.
I am having the same issue and this is what is in the html:
<!doctype html><html lang="en"><head><meta charset="utf-8"/><meta name="viewport" content="width=device-width,initial-scale=1"/><meta name="theme-color" content="#000000"/><meta name="description" content="RedShelf Reader"/><title>RedShelf Reader</title><link rel="shortcut icon" href="/read/static/images/favicon.ico"/><link rel="icon" type="image/png" href="/read/static/images/icon.png"/><link rel="icon" sizes="192x192" href="/read/static/images/touch-icon-192x192.png"/><link rel="apple-touch-icon-precomposed" sizes="180x180" href="/read/static/images/apple-touch-icon-180x180-precomposed.png"/><link rel="apple-touch-icon-precomposed" sizes="152x152" href="/read/static/images/apple-touch-icon-152x152-precomposed.png"/><link rel="apple-touch-icon-precomposed" sizes="120x120" href="/read/static/images/apple-touch-icon-120x120-precomposed.png"/><link rel="apple-touch-icon-precomposed" sizes="76x76" href="/read/static/images/apple-touch-icon-76x76-precomposed.png"/><link rel="apple-touch-icon-precomposed" href="/read/static/images/apple-touch-icon-precomposed.png"/><script>window.MathJax={options:{renderActions:{addMenu:[],checkLoading:[]},ignoreHtmlClass:"tex2jax_ignore",processHtmlClass:"tex2jax_process"},tex:{autoload:{color:[],colorV2:["color"]},packages:{"[+]":["noerrors"]}},loader:{load:["input/asciimath","[tex]/noerrors"]}}</script><script src="/read/static/js/MathJax-3.2.0/es5/tex-mml-chtml.js" id="MathJax-script"></script><script>if("serviceWorker"in navigator){window.addEventListener("load",(async()=>{try{window.readerServiceWorker=(await navigator.serviceWorker.register("/read/sw.js"))?.active}catch(e){console.error("RedShelf eReader Service Worker registration failed.",e)}}));let e=!1;navigator.serviceWorker.addEventListener("controllerchange",(()=>{e||(e=!0,window.location.reload())}))}</script><style>body{overscroll-behavior-y:none;overflow:hidden;position:fixed}</style><script defer="defer" src="/read/static/js/main.8e1079a1.js"></script><link href="/read/static/css/main.4b2ac2d4.css" rel="stylesheet"></head><body class="reader-body"><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div></body></html>
Is this issue being worked on, or been fixed? I have a textbook I'd like to use this on and I do have a time limit to do it.
@Comrade-Spood Sorry I am a little busy with other things. Do you get the same html file as @akoot did? If it says you need javascript enabled then perhaps you are using an incorrect url to get the page.
@Comrade-Spood Sorry I am a little busy with other things. Do you get the same html file as @akoot did? If it says you need javascript enabled then perhaps you are using an incorrect url to get the page.
I got what DaystromInst got. For some reason I could not figure out how to get it to give me the right html file.
like the title says i keep getting this here is the fulPS D:\redshelf_downloader-master\redshelf_downloader-master> python scrape.py [Thread-1 (download_thread)] Downloading page 1 Exception in thread Thread-1 (download_thread): Traceback (most recent call last): File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64qbz5n2kfra8p0\Lib\threading.py", line 1075, in _bootstrap_inner self.run() File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64qbz5n2kfra8p0\Lib\threading.py", line 1012, in run self._target(*self._args, *self._kwargs) File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 115, in download_thread download_page(i) File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 82, in download_page base_url = get_base_url(raw) ^^^^^^^^^^^^^^^^^ File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 38, in get_base_url return re.search("<base href=\"(.?/(OPS|OEBPS)).\"/>", raw).group(1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'group' [Thread-2 (convert_thread)] Converting page 1 to PDF Exception in thread Thread-2 (convert_thread): Traceback (most recent call last): File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64qbz5n2kfra8p0\Lib\threading.py", line 1075, in _bootstrap_inner self.run() File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64qbz5n2kfra8p0\Lib\threading.py", line 1012, in run self._target(self._args, **self._kwargs) File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 121, in convert_thread convert_html_to_pdf(i) File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 89, in convert_html_to_pdf html = Path(f"{PAGE_PATH}/{page}/html/{page}.html").read_text(encoding="utf-8") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64qbz5n2kfra8p0\Lib\pathlib.py", line 1027, in read_text with self.open(mode='r', encoding=encoding, errors=errors) as f: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1520.0_x64__qbz5n2kfra8p0\Lib\pathlib.py", line 1013, in open return io.open(self, mode, buffering, encoding, errors, newline) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: 'pages\1\html\1.html' Merging PDF files Traceback (most recent call last): File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 152, in
merge_pdf_files()
File "D:\redshelf_downloader-master\redshelf_downloader-master\scrape.py", line 104, in merge_pdf_files
main_pdf = pymupdf.open(Path(f"{PAGE_PATH}/1/1.pdf"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mattt\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pymupdf\ init.py", line 2763, in init__
raise FileNotFoundError(msg)
pymupdf.FileNotFoundError: no such file: 'pages\1\1.pdf'