Closed God-damnit-all closed 4 years ago
Ugh. I'll see about removing that dependency entirely.
FWIW, if you restart it, it should resume fine where it left off.
The issue does say that a workaround is to reuse the object, so if you can figure out a way for the object to not be discarded and to just keep getting reused, that would work too.
I mean, sure, but mechanical soup is kind of silly anyways. It's basically just a thingie that has some ease of access methods (which I already have in my own stuff), and a bit that remembers the referrer (which I don't, but would take like 10 lines to implement).
The reason I have that dependency is because the random twitter scraper thingie I found used it, and I'm lazy and didn't want to have to rewrite the xpath stuff.
Ah, okay. Best to remove it then.
I think that should fix it, assuming the issue isn't in the alternative HTML parser API requests-html
uses.
I left that in, mostly because xpath is annoying.
Hmm. I think that helped? It's hard to tell without having it run for a while. I will say that dAGet is using up quite a lot of memory too, but because it doesn't have nearly as much work to do, I haven't noticed until now. It bloats up to 1.7GB in just 25 minutes. TwitGet is only 425GB right now, but I wonder if that's because of how often it sleeps.
Does the methodology used by dAGet and TwitGet have something in common?
I've read it may be helpful to use the decompose method. I see it used in SfGet but not anywhere else that's currently functional.
As long as I don't keep the bs4 Soup objects around, their memory use isn't relevant, since they're disposed of as soon as they go out of scope.
The reason I'm doing decompose stuff in sfget is part of cleaning up the contents I extract to save as the image description, not because of memory reasons.
How are you measuring memory usage? The instance of twitget I'm running is using just 117M of real memory, though it's allocated 462M of virtual memory, and that's after a number of hours. DA is using 157M (363M virtual).
The web process, interestingly, has 1862M of virtual memory allocated, but only 64M actually resident.
Certainly, the peak memory usage is going to generally be at minimum 2-3X the size of the largest HTTP request response it receives, but that's generally only an issue when you have 500+ megabyte responses, assuming a reasonable memory limit (the VM I run for this has 8 GB of RAM allocated to it).
Now, these numbers are on linux, and I'm not sure about windows aside from the fact that I really don't trust just about anything the new W10 task manager says. Process Explorer (https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer) is much more actually useful, and it can replace the task manager completely on windows as an option.
I'm using something similar called Process Hacker v3
As you can see, the virtual size is much larger.
Here, this is from Process Explorer which I grabbed real quick.
Huh. Weird. No idea what's going on, but if you're not seeing a steady increase in memory usage, it's probably just python holding onto stuff. Python is fairly reluctant to release memory back to the OS (it has it's own internal allocator, actually).
When I go to the Handles tab in Process Explorer, I'm seeing it constantly populating the list with more of this:
Very few handles are being closed but tons are being opened.
that looks like it's just continuously re-reading the local HTTPS certificates. I'm not sure why it's doing that, but unless it's keeping the file-handles open, it should be harmless.
File handles are being properly opened and closed, but I think this is where the memory issue is coming from. According to the quota charges, none of these handles are going to the virtual pool.
if the handles are being closed, they'll no longer be relevant once they're closed, but I've not looked at python on windows under super long running execution before.
Forgive me if this is ignorant, but I noticed that the TwitGet functions end in yield, and I've read that this essentially suspends the function so the process can come back to it later.
But I don't actually see any way for it to actually conclude...?
if the handles are being closed, they'll no longer be relevant once they're closed, but I've not looked at python on windows under super long running execution before.
I was referring specifically to the file handles. The key handles never close.
Forgive me if this is ignorant, but I noticed that the TwitGet functions end in yield, and I've read that this essentially suspends the function so the process can come back to it later.
But I don't actually see any way for it to actually conclude...?
Nah, what it means is that it returns an iterator.
What it means is effectively "return this value, but resume here on next iteration".
When gen_tweets_for_date_span gen_tweets_for_date_span()
or gen_tweets()
run out of tweets, they break (or here), which causes them to return a special exception StopIteration
which causes the containing loop to terminate.
Basically, what happens is gen_tweets_for_date_span()
, get_recent_tweets()
and get_all_tweets()
return objects that look like lists, but don't generate all their contents ahead of time.
TL;DR iterators can be confusing and look weird.
I see.
I wonder if it would help to use the REQUESTS_CA_BUNDLE env variable and point it to my cURL's ca store.
I'm not sure why python doesn't just read the cert bundles at the start, and just cache it.
OTOH, fixing that is so far down the issue rabbit hole I'm not sure what to do. If this is indeed the issue, you've found a actual python runtime bug.
Mind you, this wouldn't be the first actual python bug I've encountered, but still. Whooo!
Joy of joys. And setting REQUESTS_CA_BUNDLE doesn't seem to have any effect.
I should look into if 3.8 is ready for the big leagues yet, maybe that'll fix the issue.
Also I guess you got tricked into doing work you could've put off until later, my apologies.
Eh, I'm super lazy, but easy to goad into doing things that should probably be done anyways.
What did you use to notice the leaking handles?
Process Hacker v3 (v2 is the stable version that no one uses because it's ancient now)
I think it's been fixed in 3.7.6, memory usage is now at normal levels.
Well that's good to know. I wonder what the bug was.
@fake-name Right now I'm running TwitGet and I'm at 10GB memory being used by one python process.
Suspected issue: https://github.com/MechanicalSoup/MechanicalSoup/issues/253