danny0838 / webscrapbook

A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
Mozilla Public License 2.0
903 stars 120 forks source link

Ablility to have the url/title displayed on maff./html archive opened by any browser #126

Closed sj365 closed 4 years ago

sj365 commented 5 years ago

This function is generally present in most page saver extensions allowing the saved title/url/date to be displayed on opening the previously saved webpage. If I have missed this functionality in the current version please advise where/how this can be implemented.

danny0838 commented 4 years ago

This feature requires injecting additional DOM elements into the original document and could potentially break faithfulness. A good design would be required.

sj365 commented 4 years ago

Danny, I dont understand. SavePage WE overlays a banner with url and date and SingleFile overlays a little graphic to click that gives the page url and date, both without altering the faithfulness of the page. Is this impossible in Webscrapbook?

danny0838 commented 4 years ago

WebScrapBook is focused on preserving faithful and organizable data from web pages and alike. Page metadata such as source URL, capture date, title, etc. are recorded in the captured file, and can be viewed from the sidebar or the index page generated by the site indexer. You can also see them by opening the source code of the captured page and find them from the element.

On the other hand, an explicit bar is basically a redundant information, and as it's permanently injected into the captured page, we need to design it carefully to make it compatible with as more browsers as possible and won't mask the original page content due to malfunction and inability to close/hide. It's not impossible, but it needs a careful design, and is not of high priority compared to other challenges we're facing.

You can think about why you need this feature so much? If you never use the sidebar or the site indexer, and simply want an easy quick solution for a single-HTML copy of a web page, SingleFile and SavePage WE may be the more suitable tools for you.

sj365 commented 4 years ago

Danny, thank you for replying. Yes, I simply need a 99%+ solution for a single-HTML copy of a web page. As a student who regularly uses webpage resources for projects and news archiving, it is essential to have accurate documentation about where the data is sourced. Since I am more than a casual user of these types of programs, I am keenly aware of issues that affect the saved page and am quick to notify the creators not to criticize them, but to help them help me.

Mozilla Archive Format was able to achieve the 99%+ accuracy I need, but cannot be used in the current Firefox versions which is why I have searched so hard for a suitable replacement. I have personally checked every application in Chrome Store and Firefox addons and the top three are WebScrapBook, SingleFile, and Save Page WE. The issue has been that not one of these programs can reach 99%+, they each have their own issues with particular documents/webpage formatting. You can check on Github to see that I have contributed several crucial bug reports for all three programs; again, to help them help me.

I have found that WebScrapBook would likely be the top choice MAF replacement due to the highest page accuracy and ability to open .maff formats, except that it does not display source information. I understand what you have said about the sidebar or site indexer, and as the developer it is your option to make the program what you want. From my user perspective, I would always ask for what I need. Using the sidebar or indexer would be overkill if I just want to open the .html and know the source details along with the article. I'm not sure what you mean about a change to the page faithfulness due to an indicator of some sort since the other programs accomplish it somehow without a change. Also, I would think that most other programs provide this information due to them recognizing it is an important feature. Why don't I use them then? Yours is better for the most part and I would rather use just one application than having to rotate between three for any particular page which is what I do now since no program except MAF has saved all pages correctly.

url date.zip

In either case, thank you for your time and effort developing this product. Hopefully, you can see the merits of my case.

danny0838 commented 4 years ago

Thank you for the additional feedback. It is true that there's currently very few tools that can save and read a MAF file besides WSB...

One thing to clarify is that there are basically two different way of implementing an info bar for an archive page:

  1. Show an info bar when reading an archive page.
  2. Save an explicit info bar in the archive page.

The key difference is that approach 1 only works when the browser have an extension installed, and only when opening an archive page that has been initially captured by a supported tool (usually the extension itself). There's no info bar if the user opens the archive page with another browser.

OTOH, approach 2. (should) work for any browser that opens the archive page. But, as we've previous mentioned, faithfulness and cross browser compatibility are to be carefully concerned. A good design is also required as the info bar is static and permanent. Additionally, it would be even more complicated for MAFF/HTZ compared with HTML, as page JavaScript does not work in WSB built-in archive page viewer. And we'd like to know: how do you usually view the archived MAFF page? The built-in archive page viewer? Or PyWebScrapBook or a supporting script?

What MAF add-on does is approach 1, Save Page WE does approach 2, and Single File seems to provide both ways. What WSB currently does is something like approach 1, but the user has to explicitly open from the sidebar.

What is the exact approach you are requesting for and are preferred with?

danny0838 commented 4 years ago

Another thing I'm curious about is that you mentioned that MAF addon can achieve 99%+ accuracy for capturing a web page, while other extensions, including WSB, cannot. Can you be more specific about it? What are something that MAF can do while WSB cannot when it comes to capturing a web page?

Anyway, the web technology is evolving quickly these days. MAF addon may have done quite well the other day, but it probably can't work well with modern web pages. For example, capturing shadowRoot (one you have requested) is something very difficult to deal with, and MAF simply doesn't support it. HTML5 elements such as audio, video, canvas, image srcset, and some CSS3/4 features, are also what MAF cannot deal with. Unfortunately we probably cannot go back and use MAF addon if we want to do a good capture for a modern web page in the present...

sj365 commented 4 years ago

Again, thanks for spending time with this request. I will try to address your questions:

ways of implementing an info bar for an archive page:

saved with spw The-New-York-Times.zip

I've also tried to test whether the current Chrome 80 or Firefox 73.01 is able to view the info bar of .maff files created by MAF FF52. Chrome can view them, but doesnt show the infobar. Firefox cannot be tested because WSB is unable to save/open WSB or MAF .maff files at this time. (see included error msgs).

webscrapbk options.zip

Normally, I would use the WSB built in archive page viewer to open .maff files in Chrome and Firefox (post version 52). I dont use the sidebar at all because last year when i first tried the other scrapbook functions it didnt work for me, so I just stuck to saving as .html which is simpler and universal.

If an info bar could be included on .maff files opened with chrome or firefox (only for browsers which install WSB) that would be great.

-MAF Accuracy I only mentioned MAF because for Firefox that was the gold standard. It makes sense it would work very well since it was designed by someone who was directly involved with Firefox. I agree with you that as time goes on it will not maintain accuracy due to changing web standards. It currently is having trouble with graphics for certain pages like Twitter. I only use it now as a backup in FFx52 in case the current page savers dont work on a page.

I hope this info helps.

danny0838 commented 4 years ago

@sj365 Finally we found a way to deal with all the potential compatibility issues and added the option to insert an infobar in 0.82.0. Unfortunately, the infobar requires JavaScript, which is not supported by the built-in archive page viewer. If you use HTZ/MAFF and want the infobar, you have to view them with a technique that supports JavaScript, such as open unzipped page, the PyWSB viewer, or the PyWSB backend server.