HairySpoon / htlfc

Hypertext Legacy File Converter
GNU Affero General Public License v3.0
1 stars 0 forks source link

Are metadata somehow preserved? #3

Closed mcepl closed 1 year ago

mcepl commented 1 year ago

With this archive ON WAR AGAINST THE TURK.zip (again, .maff renamed to .zip because of GitHub), I have two problems:

  1. conversion fails
stitny/tmp$ htlfc ON\ WAR\ AGAINST\ THE\ TURK.maff 
Traceback (most recent call last):
  File "/home/matej/.bin/htlfc", line 8, in <module>
    sys.exit(run_htlfc())
  File "/home/matej/.local/lib/python3.10/site-packages/htlfc/__init__.py", line 6, in run_htlfc
    main.main()
  File "/home/matej/.local/lib/python3.10/site-packages/htlfc/main.py", line 125, in main
    target = convert.convert(source) # conversion
  File "/home/matej/.local/lib/python3.10/site-packages/htlfc/merger/convert.py", line 14, in convert
    content = ET(indexhtml)
  File "/home/matej/.local/lib/python3.10/site-packages/htlfc/merger/xmltree.py", line 26, in __init__
    self.__new_tree(filepath)
  File "/home/matej/.local/lib/python3.10/site-packages/htlfc/merger/xmltree.py", line 35, in __new_tree
    raise RuntimeError(f"Unable to read: {filepath}")
RuntimeError: Unable to read: /tmp/tmp5d9w8tuo/1451729551337_119/index.html
stitny/tmp$ 
  1. When I unzip the archive, the issue is not that problematic: it is just a simple HTML with no additional resources. However, MAFF archive contains also metadata in index.rdf file and it would be pity to loose it.
HairySpoon commented 1 year ago

About 1. The attached file appears to be a valid .maff, however it's index.html is empty.

You probably know this but for the record: the maff can be unzipped or do htlfc.py infile outfile --pause and examine index.html.

More to follow.

HairySpoon commented 1 year ago

About 2. When the maff extension was installed in firefox, and a .maff file was opended, the extension would read index.rdf and display its contents in a banner. Without the extension I reasoned, there was no point in saving the meta data.

Now I am using SingleFile and observe its behaviour. Meta data (url and save date) are embedded in the <!DOCTYPE declaration at the top of the file. That would be easy to replicate with HTLFC. But it would be pointless without means to render it for the user.

The next part is not so easy (for me at least). It seems (I'm not sure) that SingleFile has embedded code to render the information as an info bar in the top right hand corner of the page. That is as far as I got.

To replicate this feature, I need example code that does a sticky element (easy) which then reads the !DOCTYPE meta data (hard) and uses JS to dynamically open up the url (also hard). It may be possible to get this from SingleFile code but I haven't looked at it.

mcepl commented 1 year ago

About 2. When the maff extension was installed in firefox, and a .maff file was opended, the extension would read index.rdf and display its contents in a banner. Without the extension I reasoned, there was no point in saving the meta data.

You are right, and yes that’s pretty ugly. The main purpose of MAFF was to save webpage for the offline use (or archival use). Anyway, this ticket be could closed.

Expect for the discussion of metadata. I don’t know how to display them either, but at least if they were somehow preserved, even <meta name="something" content="some content"> would be better than throw those metadata away, in my opinion.

HairySpoon commented 1 year ago

Good news. I implemented a simple info bar (without the features of SingleFile). It works with meta data from maff and mht files. However, some old maff files in my collection lacked the originalurl resource so these reflect only the date.

Please test with your files.

mcepl commented 1 year ago

Waiting on https://github.com/HairySpoon/htlfc/issues/4