cpsdqs / cohost-dl

download cohost onto your computer
MIT License
47 stars 5 forks source link

downloading >100k liked posts makes the program unhappy #14

Open mrkranston opened 2 weeks ago

mrkranston commented 2 weeks ago

When i tried to run cohost-dl, i eventually got this error: GET https://cohost.org/rc/liked-posts?refTimestamp=1727056241374&skipPosts=145240 GET https://cohost.org/rc/liked-posts?refTimestamp=1727056241374&skipPosts=145260 ←[0m←[1m←[31merror←[0m: Uncaught (in promise) RangeError: Invalid string length await ctx.write("liked.json", JSON.stringify(liked)); ←[0m←[31m ^←[0m at ←[0m←[1m←[3mJSON.stringify←[0m (←[0m←[36m<anonymous>←[0m) at ←[0m←[36mfile:///F:/downloads/cohost-dl-main/main.ts←[0m:←[0m←[33m48←[0m:←[0m←[33m44←[0m at ←[0m←[1m←[3meventLoopTick←[0m (←[0m←[36mext:core/01_core.js←[0m:←[0m←[33m175←[0m:←[0m←[33m7←[0m) PS F:\downloads\cohost-dl-main> I'm inclined to believe that i do actually have ~140k liked posts, since i just loaded one of the urls it printed, and it shows some posts from 2022. I'm not sure if it matters for this particular case, but i am using windows for this.

cpsdqs commented 2 weeks ago

wow… I didn’t know that error existed. I feel like I should warn you: as far as I can tell, 1000 posts seem to correspond to something around 1 GB of downloaded size, meaning your download would have a size on the order of 100 GB.

I am seeing several resolutions:

  1. A new option that stores likes more efficiently and allows this code to succeed. If you have the disk space…?
    • I could also add an option to skip downloading some or all images, which could help the disk space problem
  2. A new option that makes it stop loading liked-posts after reaching some maximum number
  3. Not downloading any of them. To do this, create a liked.json file inside the out directory and put [] in it
mrkranston commented 2 weeks ago

Option 1 or 2 sounds good :3 maybe something else to consider for option 1 would be some kind of rate limit for large downloads such as this. I dont want to crash any servers by making requests like "give me 100,000 posts ASAP please". Perhaps option 2 is more feasible, but i will leave that up to your judgement

cpsdqs commented 2 weeks ago

Alright, I’ve decided to just improve the file format that likes are stored in. It should theoretically work now with the latest commit (63d1070)… I can’t test this because I don’t have that many posts. I hope it doesn’t crash afterwards?

You can also slow down your requests with a new REQUEST_DELAY_SECS option.

Note: I am sort of expecting something to crash for sure if you reach “generating index for all posts,” because that’s going to try and create a full-text search index of every post you’ve downloaded, which would probably be like 1 GB in size if it succeeds at all, and kill every browser you open it in. If it crashes before that though, then I will consider that a fixable problem

mrkranston commented 2 weeks ago

I'll let you know how it works out when i get home :3 thank you for the fast response (and for making the tool in the first place)

mrkranston commented 2 weeks ago

Ran it overnight, and it managed to generate a ~1 gb file full of the liked posts. It ran out of memory in certain spots after that, but i was able to just restart it and it picked up where it left off. After a few tries it's clear that there's one particular point where it runs out of memory, but i'm not entirely sure if it's at the point it starts making the full-text search:

~~ cohost source version 3c7903d6

compiling Javascript: post-index compiling Javascript: post-page

<--- Last few GCs --->

[16320:000001FE32A64000] 71201 ms: Mark-Compact (reduce) 1360.1 (1390.8) -> 1347.3 (1350.6) MB, pooled: 0 MB, 14.61 / 0.00 ms (+ 118.4 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 143 ms) ... cohostdlerror3.txt

cpsdqs commented 2 weeks ago

No, it’s definitely not supposed to be crashing there. The simplest solution to that would be to just give it more memory, I think? When doing deno run ..., add deno run --v8-flags=--max-old-space-size=8192 ... for an 8 GB RAM limit, for example

mrkranston commented 2 weeks ago

it seems to have gotten to the point where it's downloading posts, which is a good sign i think! we'll have to see if 12GB of memory is enough to get it to compile the index. i'll let you know how it turns out when it is finished

cpsdqs commented 2 weeks ago

good luck downloading (checks notes) 2% of cohost.org! lol

it might take… a few days