fake-name / xA-Scraper

69 stars 8 forks source link

TwitGet takes a very, very, very, very long time for one run #78

Closed God-damnit-all closed 4 years ago

God-damnit-all commented 4 years ago

I'm scraping 154 people on Twitter. According to Process Hacker, my latest run started 13 hours and 30 minutes ago. It's still going strong, I'm running it in a console I can easily monitor.

But I think with the sleeping it has to do combined with going through everyone's entire twitter history every single run (even though it's impossible to update old tweets on Twitter) just makes it take forever. Since it starts from the past and works its way toward the present, would it be reasonable to request it track where it left off for each user (probably writing it during the sleep)?

My biggest concern is missing new submissions. Sometimes tweets get taken down within a few hours (the most common reason being a commissioner throwing a fit). If TwitGet were faster, I could schedule it to run 1 hour after when it last ran through the script I made (instead of scheduling it via system time).

Also, having up to 10% of my CPU being taken up all day kinda sucks.

fake-name commented 4 years ago

Yeah, this is definitely on my todo list.

fake-name commented 4 years ago

Since it starts from the past and works its way toward the present,

Uh, it doesn't? It starts with the current date, and works back towards the date the user joined twitter, or at least it should.

God-damnit-all commented 4 years ago

Uh, it doesn't? It starts with the current date, and works back towards the date the user joined twitter, or at least it should.

Err... I just assumed it worked that way because it doesn't seem as efficient to do it the other way around.

If you work toward the present you only have to store where it left off, but if you work toward the past you have to store where it started and where it left off.

fake-name commented 4 years ago

My general intention was that on first run, it would work from the current date to the time the targetted user joined twitter (I pull that out).

After that, each subsequent run would just be from the latest date to whenever the last time that user was scraped (perhaps with some overlap).

Until a complete walk of a user is accomplished once, halting will restart the process.

God-damnit-all commented 4 years ago

Yeah, I know it pulls the joined date. I don't really see an advantage doing it the current way other than you get the more recent stuff first. You have to start over for a user (or keep a date range saved) instead of just the most recent tweet downloaded, which can be easily resumed from regardless of whether a full crawl was done yet.

fake-name commented 4 years ago

Uh, that's a valid argument.

Mostly, I already track the last time I scraped each artist, so this can bolt onto that without much work.

OTOH, I think I've thought about the approach more in this discussion then I did in the actual implementation of the thing.

God-damnit-all commented 4 years ago

Holy shit, it's still going. Is it looping? I don't understand.

fake-name commented 4 years ago

Depending on how many tweets people have, it wouldn't surprise me.

It's running a single thread, and fetches 50 tweets every 5 seconds. If someone has 20K tweets, that's 2000 seconds, or 33 minutes for a single target. If you have 154 people you're scraping, that's 85 hours.

Now, not everyone has 20K tweets, but there's other limiting factors in terms of rate as well.

Frankly, for the volume of people I scrape, it wouldn't surprise me if it took weeks for the first run.

God-damnit-all commented 4 years ago

I feel like 154 people should not be taking this long. It has done a full run in the past, so I'm not entirely sure why it's taking this long this time.

Then again, I think that was before you implemented the new code that makes it not miss any tweets, but still. I had plenty of this stuff scraped prior.

fake-name commented 4 years ago

Yeah, right now it doesn't help much if stuff is already retrieved.

You can speed stuff up by turning up the active thread count, but I don't know how sensitive, if at all, twitter is about scraping their stuff. A lot of the timeouts are fairly conservative because I don't know what twitter's rate limits are (if any).

God-damnit-all commented 4 years ago

I've emailed you an error I've seen popping up lately, I'm concerned it might be damaging the overall retrieval process.

God-damnit-all commented 4 years ago

It finally finished this morning, thank goodness. I was getting worried. I'm still wondering about that error I emailed you though, I'm concerned it will lead to tweets being skipped over.

God-damnit-all commented 4 years ago

After that, each subsequent run would just be from the latest date to whenever the last time that user was scraped (perhaps with some overlap).

Sadly, this doesn't seem to have worked. I started the scraper up again right after my last comment and it's been running since. I'm seeing it scrape the accounts of users it scraped when it successfully completed last run.

fake-name commented 4 years ago

Oh, I haven't actually done that part yet. That's my intended design, not the current implementation (which is maximally dumb).

Fortunately, I have some time off for the holidays, so I should be able to get that done soon.

The error you e-mailed me about is probably harmless. I suspect it's not handling deleted tweets properly, but I haven't had time to investigate.

God-damnit-all commented 4 years ago

You can speed stuff up by turning up the active thread count,

This doesn't seem to work. I checked the running processes and the thread counts with only the twitter scraper running. It doesn't seem like it's going any faster and there's only one thread for it running.

fake-name commented 4 years ago

You are correct. I had forgotten I had overridden the threading stuff because that will allow me to access the artist run time stuff. Gah.

\<Insert wet noises>

fake-name commented 4 years ago

Ok, it should now remember where it left off. I also ran the fetch the other direction, mostly because it makes things neater (I can continuously update the fetch progress as it runs, so if things get interrupted it'll resume properly).

God-damnit-all commented 4 years ago

@fake-name If I may make a suggestion, if a username hits an error, it should move to the next username, so it'll keep retrying that part it errored at later (for debugging, like with that error I emailed you)

God-damnit-all commented 4 years ago

Other than errors, the net blipping out might also cause a chunk of tweets getting skipped.