Open KagurazakaShirosatosu opened 4 years ago
Unable to detect page title
It's caused by wechat media platform(article) page has an empty
Maybe archivebox could try to parse og:title twitter:title fields to get title.
title screenshot https://www.dropbox.com/s/k90irdvcdicl986/Screenshot%202020-03-22%2023.04.16.png?dl=0
@pirate the title is not present, is it ok to look in the og:title
tag as a fallback? (the retrieval is stopping because of this)
About the images, they are being lazy-loaded. I am not sure if wget can handle that, but that is something we can check after the title issue is fixed.
Yeah you can look in og:title, but lets not handle the image lazy loading right now, that's a very complex problem.
Greetings. I wonder if it is possible at this point to revisit this problem?
Given the prominence of Wechat (and the most comprehensive censoring, as OP also mentioned), I'd wager that articles on mp.weixin.qq.com are probably the most common target for archiving for users in China. It certainly is the case for me, as most of the articles I have felt a need to archive are on there. It would be wonderful to have ArchiveBox available for this usage, especially as the archive.* sites now block all web host proxy users.
Apologies for digging up an old thread and thank you all for your hard work.
I recommend seeing if they can be archived with SingleFile, and if not, raising the issue on that repo. ArchiveBox does not itself do any archiving, it's just a collection of other utilities that do the actual archiving. If there are issues with archive fidelity in general the issue is to raise those issues with the sub-utilities or add a new extractor.
Hi, I found that mp.weixin.qq.com can be captured by SingleFile including pictures if I scroll down the entire webpage manually and wait all pictures finish loading. I think ArchiveBox can scroll to the bottom of the page, then wait for networkidle0 and then call SignleFile to capture it.
ArchiveBox does not have granular control over how the singlefile capture is done, we only call the SingleFile CLI. Unless they provide a CLI option to scroll before capturing, we cannot do that.
Hello, I found that SingleFile CLI has an option --load-deferred-images-dispatch-scroll-event
. When it is enable, lazy-loaded images can be saved perfectly (at least on WeChat's Page).
https://github.com/gildas-lormeau/single-file-cli/blob/master/args.js#L206
Thats a great find! 🥳 Thanks. Lets add that option to archivebox by default then.
On a lot of my pages, turning --load-deferred-images-dispatch-scroll-event
to true causes some "subscribe to my newsletter" popup to come up while having it off prevents this and still loads deferred images. So it probably shouldn't be a default.
This is all moot though because the bundled singlefile is outdated and doesn't support deferred images at all right now
Type
What is the problem that your feature request solves
I archived a page form Wechat Open Platform (such as https://mp.weixin.qq.com/s/ri4nDgPQo4OVWaIWG9EQZA) but I found that all images on the page can't be archived. (https://archive.sager.wang/archive/1580637904/mp.weixin.qq.com/s/ri4nDgPQo4OVWaIWG9EQZA.html)
and the title also show "Unable to detect page title" in index of the archive box.
WeChat is the biggest IM in China and it has the strictest censor there. So I am hopping archive box can archive the page with images.
I am sorry for my bad English :-)
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
I hope archive box can archive the page (with images) from Wechat Open Platform.
What hacks or alternative solutions have you tried to solve the problem?
archive.is can archive the page from wechat open platform with images.
How badly do you want this new feature?