eloquence / freeyourstuff.cc

freeyourstuff.cc - universal content liberation
Creative Commons Zero v1.0 Universal
79 stars 4 forks source link

Page crashes after 4,500 Quora answers downloaded #97

Open vttoth opened 5 years ago

vttoth commented 5 years ago

I am trying to download my Quora answers (over 6,000). When the count gets to 4505, the page crashes (Chrome crash, the "aw snap" message.)

eloquence commented 5 years ago

Thanks for the report, trying to reproduce now.

eloquence commented 5 years ago

Hi Viktor, the browser crash was most likely due to running out of memory. I can see if I can improve memory management during the extension run, but in the meantime, ensuring sufficient free memory before attempting the download should fix the issue.

How much is sufficient? I was able to download 6,072 answers for you on a machine with 16GB RAM, but I did have to end all other applications before it would go all the way through (it did crash on the first run, with an out of memory error from the operating system).

I do have the 6,072 answers in JSON format if that would be helpful and would be happy to email them to you, just ping me at eloquence AT gmail DOT com.

vttoth commented 5 years ago

Thanks for the response. This machine (Windows 10 64-bit) actually has 32 GB of RAM, yet Chrome crashes (with plenty of free RAM remaining). But I was able to download my stuff just fine on a Linux machine (also 32 GB). In fact, I just came back to report that fact when I saw your message. Thanks for the quick support!

vttoth commented 5 years ago

I am now running into the same issue on Linux, too. After a little less than 4000 answers, aw, snap, says Chrome. Chrome is up to date, machine has 32 GB of RAM, same issue occurs on a machine with a lot less memory, same issue occurs on Windows 10 and Windows Server 2016. Suggestion (forgive me if it is just incompatible with the plugin architecture): Would it be possible to break up the download into, say, 1000-answer chunks?

vttoth commented 5 years ago

I should have added, Chrome is up-to-date and all other plugins were disabled.

eloquence commented 5 years ago

Thanks for the report, and sorry you're now experiencing this issue on all machines. Unfortunately I don't see an obvious way to split the download into chunks. We're basically pretending to keep paging through the content the way a user would, and there does not appear to be any support for offsets in Quora's internal APIs, at least not in a way that I can determine from the highly obfuscated nature of the network requests.

There is one other technical avenue which could work, which is the https://www.quora.com/content set of pages, which is at least segmented by year. The downsides:

Since that's a possible dead end, and hard for me to test, I'm not going down that road yet, but I would encourage others to try that approach as well.

As far as I can tell, the problem with our current approach is that memory usage keeps growing with each request, even though elements are removed from the DOM as we go. I suspect there are standard optimization techniques we can use to make sure the process frees up more memory as it goes, which then would reduce the "Aw, snap!" likelihood dramatically. That seems the most fruitful avenue to dig into further, but it would take a few hours of research, so will take me a while to get into.

If you yourself are interested in poking at the extension, and would like a code walkthrough, please do let me know, and I'd be happy to assist with that.

eloquence commented 5 years ago

I did a bit more poking today to see if I can do anything in the extension itself to improve memory usage.

Unfortunately, my preliminary investigation suggests that the increased memory usage as we load more and more answers is caused by the code that Quora itself runs. Beyond just rendering the answers, it holds references to them in memory, which the extension cannot clear out.

Your best bet right now is to use Quark, which is a Firefox extension. It doesn't let you publish your answers to FreeYourStuff.cc, but it does let you download them: https://addons.mozilla.org/en-US/firefox/addon/quark/

Quark does what I'm suggesting above, which is to spider https://www.quora.com/content year-by-year and answer-by-answer. That approach is much less prone to memory leaks. I've taken a quick look at the code, and it doesn't look like it's doing anything evil. :)

The biggest problem with this approach is still that I can't easily test it with accounts other than my own, as https://www.quora.com/content is per-user, whereas the "answers" URL is public. Since I only have a few answers, I'm worried that if I switch to that approach, I'll lose the ability to test. That said, it may be worth offering it as an experimental option at least.

In any case, if you haven't already done so, it would be useful if you could give Quark a spin and let me know if it works on your account.