fake-name / xA-Scraper

69 stars 8 forks source link

TwitGet causing extremely high memory use #77

Closed God-damnit-all closed 4 years ago

God-damnit-all commented 4 years ago

@fake-name Right now I'm running TwitGet and I'm at 10GB memory being used by one python process.

Suspected issue: https://github.com/MechanicalSoup/MechanicalSoup/issues/253

fake-name commented 4 years ago

Ugh. I'll see about removing that dependency entirely.

fake-name commented 4 years ago

FWIW, if you restart it, it should resume fine where it left off.

God-damnit-all commented 4 years ago

The issue does say that a workaround is to reuse the object, so if you can figure out a way for the object to not be discarded and to just keep getting reused, that would work too.

fake-name commented 4 years ago

I mean, sure, but mechanical soup is kind of silly anyways. It's basically just a thingie that has some ease of access methods (which I already have in my own stuff), and a bit that remembers the referrer (which I don't, but would take like 10 lines to implement).

The reason I have that dependency is because the random twitter scraper thingie I found used it, and I'm lazy and didn't want to have to rewrite the xpath stuff.

God-damnit-all commented 4 years ago

Ah, okay. Best to remove it then.

fake-name commented 4 years ago

I think that should fix it, assuming the issue isn't in the alternative HTML parser API requests-html uses.

I left that in, mostly because xpath is annoying.

God-damnit-all commented 4 years ago

Hmm. I think that helped? It's hard to tell without having it run for a while. I will say that dAGet is using up quite a lot of memory too, but because it doesn't have nearly as much work to do, I haven't noticed until now. It bloats up to 1.7GB in just 25 minutes. TwitGet is only 425GB right now, but I wonder if that's because of how often it sleeps.

Does the methodology used by dAGet and TwitGet have something in common?

God-damnit-all commented 4 years ago

I've read it may be helpful to use the decompose method. I see it used in SfGet but not anywhere else that's currently functional.

fake-name commented 4 years ago

As long as I don't keep the bs4 Soup objects around, their memory use isn't relevant, since they're disposed of as soon as they go out of scope.

The reason I'm doing decompose stuff in sfget is part of cleaning up the contents I extract to save as the image description, not because of memory reasons.

How are you measuring memory usage? The instance of twitget I'm running is using just 117M of real memory, though it's allocated 462M of virtual memory, and that's after a number of hours. DA is using 157M (363M virtual).

The web process, interestingly, has 1862M of virtual memory allocated, but only 64M actually resident.

Certainly, the peak memory usage is going to generally be at minimum 2-3X the size of the largest HTTP request response it receives, but that's generally only an issue when you have 500+ megabyte responses, assuming a reasonable memory limit (the VM I run for this has 8 GB of RAM allocated to it).

Now, these numbers are on linux, and I'm not sure about windows aside from the fact that I really don't trust just about anything the new W10 task manager says. Process Explorer (https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer) is much more actually useful, and it can replace the task manager completely on windows as an option.

God-damnit-all commented 4 years ago

I'm using something similar called Process Hacker v3

image

As you can see, the virtual size is much larger.

Here, this is from Process Explorer which I grabbed real quick.

image

fake-name commented 4 years ago

Huh. Weird. No idea what's going on, but if you're not seeing a steady increase in memory usage, it's probably just python holding onto stuff. Python is fairly reluctant to release memory back to the OS (it has it's own internal allocator, actually).

God-damnit-all commented 4 years ago

When I go to the Handles tab in Process Explorer, I'm seeing it constantly populating the list with more of this:

https://streamable.com/8btyr

Very few handles are being closed but tons are being opened.

fake-name commented 4 years ago

that looks like it's just continuously re-reading the local HTTPS certificates. I'm not sure why it's doing that, but unless it's keeping the file-handles open, it should be harmless.

God-damnit-all commented 4 years ago

File handles are being properly opened and closed, but I think this is where the memory issue is coming from. According to the quota charges, none of these handles are going to the virtual pool.

fake-name commented 4 years ago

if the handles are being closed, they'll no longer be relevant once they're closed, but I've not looked at python on windows under super long running execution before.

God-damnit-all commented 4 years ago

Forgive me if this is ignorant, but I noticed that the TwitGet functions end in yield, and I've read that this essentially suspends the function so the process can come back to it later.

But I don't actually see any way for it to actually conclude...?

God-damnit-all commented 4 years ago

if the handles are being closed, they'll no longer be relevant once they're closed, but I've not looked at python on windows under super long running execution before.

I was referring specifically to the file handles. The key handles never close.

fake-name commented 4 years ago

Forgive me if this is ignorant, but I noticed that the TwitGet functions end in yield, and I've read that this essentially suspends the function so the process can come back to it later.

But I don't actually see any way for it to actually conclude...?

Nah, what it means is that it returns an iterator.

What it means is effectively "return this value, but resume here on next iteration".

When gen_tweets_for_date_span gen_tweets_for_date_span() or gen_tweets() run out of tweets, they break (or here), which causes them to return a special exception StopIteration which causes the containing loop to terminate.

Basically, what happens is gen_tweets_for_date_span() , get_recent_tweets() and get_all_tweets() return objects that look like lists, but don't generate all their contents ahead of time.

TL;DR iterators can be confusing and look weird.

God-damnit-all commented 4 years ago

I see.

I wonder if it would help to use the REQUESTS_CA_BUNDLE env variable and point it to my cURL's ca store.

fake-name commented 4 years ago

I'm not sure why python doesn't just read the cert bundles at the start, and just cache it.

OTOH, fixing that is so far down the issue rabbit hole I'm not sure what to do. If this is indeed the issue, you've found a actual python runtime bug.

Mind you, this wouldn't be the first actual python bug I've encountered, but still. Whooo!

God-damnit-all commented 4 years ago

Joy of joys. And setting REQUESTS_CA_BUNDLE doesn't seem to have any effect.

I should look into if 3.8 is ready for the big leagues yet, maybe that'll fix the issue.

God-damnit-all commented 4 years ago

Also I guess you got tricked into doing work you could've put off until later, my apologies.

fake-name commented 4 years ago

Eh, I'm super lazy, but easy to goad into doing things that should probably be done anyways.

fake-name commented 4 years ago

What did you use to notice the leaking handles?

God-damnit-all commented 4 years ago

Process Hacker v3 (v2 is the stable version that no one uses because it's ancient now)

God-damnit-all commented 4 years ago

I think it's been fixed in 3.7.6, memory usage is now at normal levels.

fake-name commented 4 years ago

Well that's good to know. I wonder what the bug was.