ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
https://archivebox.io
MIT License
21.04k stars 1.12k forks source link

Feature Request: Add SingleFile CLI option to fire scroll event and load deferred images to better support WeChat archiving #318

Open KagurazakaShirosatosu opened 4 years ago

KagurazakaShirosatosu commented 4 years ago

Type

What is the problem that your feature request solves

I archived a page form Wechat Open Platform (such as https://mp.weixin.qq.com/s/ri4nDgPQo4OVWaIWG9EQZA) but I found that all images on the page can't be archived. (https://archive.sager.wang/archive/1580637904/mp.weixin.qq.com/s/ri4nDgPQo4OVWaIWG9EQZA.html)

and the title also show "Unable to detect page title" in index of the archive box.

WeChat is the biggest IM in China and it has the strictest censor there. So I am hopping archive box can archive the page with images.

I am sorry for my bad English :-)

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

I hope archive box can archive the page (with images) from Wechat Open Platform.

What hacks or alternative solutions have you tried to solve the problem?

archive.is can archive the page from wechat open platform with images.

How badly do you want this new feature?


wych42 commented 4 years ago

Unable to detect page title

It's caused by wechat media platform(article) page has an empty field.</p> <p>Maybe archivebox could try to parse og:title twitter:title fields to get title.</p> <p>title screenshot <a href="https://www.dropbox.com/s/k90irdvcdicl986/Screenshot%202020-03-22%2023.04.16.png?dl=0">https://www.dropbox.com/s/k90irdvcdicl986/Screenshot%202020-03-22%2023.04.16.png?dl=0</a></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/cdvv7788"><img src="https://avatars.githubusercontent.com/u/5531776?v=4" />cdvv7788</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>@pirate the title is not present, is it ok to look in the <code>og:title</code> tag as a fallback? (the retrieval is stopping because of this) About the images, they are being lazy-loaded. I am not sure if wget can handle that, but that is something we can check after the title issue is fixed.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/pirate"><img src="https://avatars.githubusercontent.com/u/511499?v=4" />pirate</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>Yeah you can look in og:title, but lets not handle the image lazy loading right now, that's a very complex problem.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/hope1"><img src="https://avatars.githubusercontent.com/u/3541701?v=4" />hope1</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>Greetings. I wonder if it is possible at this point to revisit this problem?</p> <p>Given the prominence of Wechat (and the most comprehensive censoring, as OP also mentioned), I'd wager that articles on mp.weixin.qq.com are probably the most common target for archiving for users in China. It certainly is the case for me, as most of the articles I have felt a need to archive are on there. It would be wonderful to have ArchiveBox available for this usage, especially as the archive.* sites now block all web host proxy users.</p> <p>Apologies for digging up an old thread and thank you all for your hard work.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/pirate"><img src="https://avatars.githubusercontent.com/u/511499?v=4" />pirate</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>I recommend seeing if they can be archived with SingleFile, and if not, raising the issue on that repo. ArchiveBox does not itself do any archiving, it's just a collection of other utilities that do the actual archiving. If there are issues with archive fidelity in general the issue is to raise those issues with the sub-utilities or add a new extractor.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/KagurazakaShirosatosu"><img src="https://avatars.githubusercontent.com/u/13681639?v=4" />KagurazakaShirosatosu</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>Hi, I found that mp.weixin.qq.com can be captured by SingleFile including pictures if I scroll down the entire webpage manually and wait all pictures finish loading. I think ArchiveBox can scroll to the bottom of the page, then wait for networkidle0 and then call SignleFile to capture it.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/pirate"><img src="https://avatars.githubusercontent.com/u/511499?v=4" />pirate</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>ArchiveBox does not have granular control over how the singlefile capture is done, we only call the SingleFile CLI. Unless they provide a CLI option to scroll before capturing, we cannot do that.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/canoziia"><img src="https://avatars.githubusercontent.com/u/54797411?v=4" />canoziia</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>Hello, I found that SingleFile CLI has an option <code>--load-deferred-images-dispatch-scroll-event</code>. When it is enable, lazy-loaded images can be saved perfectly (at least on WeChat's Page). <a href="https://github.com/gildas-lormeau/single-file-cli/blob/master/args.js#L206">https://github.com/gildas-lormeau/single-file-cli/blob/master/args.js#L206</a></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/pirate"><img src="https://avatars.githubusercontent.com/u/511499?v=4" />pirate</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>Thats a great find! 🥳 Thanks. Lets add that option to archivebox by default then.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/melyux"><img src="https://avatars.githubusercontent.com/u/10296053?v=4" />melyux</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>On a lot of my pages, turning <code>--load-deferred-images-dispatch-scroll-event</code> to true causes some "subscribe to my newsletter" popup to come up while having it off prevents this and still loads deferred images. So it probably shouldn't be a default.</p> <p>This is all moot though because the bundled singlefile is outdated and doesn't support deferred images at all right now</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>