danmou / onenote_export

This Python script exports all the OneNote notebooks linked to your Microsoft account to HTML files.
MIT License
178 stars 43 forks source link

add check for attrs attribute in MyHTMLParser #13

Closed thfrei closed 2 years ago

thfrei commented 2 years ago

Add a check for parser.attrs

fixes #10

danmou commented 2 years ago

Thanks for the PR! However I don't think this is the right solution. It seems like the HTML parser doesn't detect a start tag. Could you add a print for tag_match[0] and let me know what value it has in the case that fails?

thfrei commented 2 years ago

Thank you Danmou for your incredibly fast response, and sorry for my delay. You are probably right, I just hacked together something, so that it would work.

I think I have to abandon using the api, since I do not get any of my hand-drawings out of it. I think I'll stick to something like .mht export and then extracting the base64-images.

maphew commented 2 years ago

Here's results of my tag_match[0] print, commit b098dc6e553d64725aa46b19ae352c20f5099d75

...snip...
    Opening page 7 Fossil CMS
      HTML file already exists; skipping this page
    Opening page 8 Fossil learning TH1
      Got content of length 19954
parser.feed tag_match: <img alt="THI:
global State flags
global State user
Run THI
&lt;Base href=&quot; $baseurl/$current_page
&lt;meta
&lt;meta http—egui•F-&quot;Content—security—policy&quot; csp&quot; />
127.0.0.1 - - [04/Jan/2022 07:48:47] "GET /getToken?code=M.R3_BAY.225081db-b1a8-f945-ac69-3ee570dcadd4&state=a6efcfda-9d5f-4e28-98dd-93b766dd6b55 HTTP/1.1" 500 -
Traceback (most recent call last):
  File "C:\ProgramData\scoop\apps\miniconda3\4.10.3\Lib\site-packages\flask\app.py", line 2091, in __call__
    return self.wsgi_app(environ, start_response)
  File "C:\ProgramData\scoop\apps\miniconda3\4.10.3\Lib\site-packages\flask\app.py", line 2076, in wsgi_app
    response = self.handle_exception(e)
  File "C:\ProgramData\scoop\apps\miniconda3\4.10.3\Lib\site-packages\flask\app.py", line 2073, in wsgi_app
    response = self.full_dispatch_request()
  File "C:\ProgramData\scoop\apps\miniconda3\4.10.3\Lib\site-packages\flask\app.py", line 1518, in full_dispatch_request

    rv = self.handle_user_exception(e)
  File "C:\ProgramData\scoop\apps\miniconda3\4.10.3\Lib\site-packages\flask\app.py", line 1516, in full_dispatch_request

    rv = self.dispatch_request()
  File "C:\ProgramData\scoop\apps\miniconda3\4.10.3\Lib\site-packages\flask\app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "C:\Users\Matt\code\onenote_export\onenote_export.py", line 240, in main_logic
    download_notebooks(graph_client, app.config['output_path'], app.config['select_path'], indent=0)
  File "C:\Users\Matt\code\onenote_export\onenote_export.py", line 179, in download_notebooks
    download_sections(graph_client, sections, path / nb_name, select, indent=indent + 1)
  File "C:\Users\Matt\code\onenote_export\onenote_export.py", line 200, in download_sections
    download_pages(graph_client, pages, path / sec_name, select, indent=indent + 1)
  File "C:\Users\Matt\code\onenote_export\onenote_export.py", line 216, in download_pages
    download_page(graph_client, page['contentUrl'], page_dir, indent=indent + 1)
  File "C:\Users\Matt\code\onenote_export\onenote_export.py", line 229, in download_page
    content = download_attachments(graph_client, content, path, indent=indent)
  File "C:\Users\Matt\code\onenote_export\onenote_export.py", line 150, in download_attachments
    content = re.sub(r"<img .*?\/>", download_image, content, flags=re.DOTALL)
  File "C:\ProgramData\scoop\apps\miniconda3\4.10.3\Lib\re.py", line 210, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "C:\Users\Matt\code\onenote_export\onenote_export.py", line 112, in download_image
    props = parser.attrs
AttributeError: 'MyHTMLParser' object has no attribute 'attrs'
maphew commented 2 years ago

Here's a screenshot of the portion of the page it seems to be having trouble with. It's two images embedded in a table

image

maphew commented 2 years ago

the complete tag match text is:

<img alt="THI:
global State flags
global State user
Run THI
&lt;Base href=&quot; $baseurl/$current_page
&lt;meta
&lt;meta http—egui•F-&quot;Content—security—policy&quot; csp&quot; />

(updated the debug statement to if debug: print(f"parser.feed tag_match: '''{tag_match[0]}''' ") to make it easier to see )