Closed God-damnit-all closed 4 years ago
Yeah, this is definitely on my todo list.
Since it starts from the past and works its way toward the present,
Uh, it doesn't? It starts with the current date, and works back towards the date the user joined twitter, or at least it should.
Uh, it doesn't? It starts with the current date, and works back towards the date the user joined twitter, or at least it should.
Err... I just assumed it worked that way because it doesn't seem as efficient to do it the other way around.
If you work toward the present you only have to store where it left off, but if you work toward the past you have to store where it started and where it left off.
My general intention was that on first run, it would work from the current date to the time the targetted user joined twitter (I pull that out).
After that, each subsequent run would just be from the latest date to whenever the last time that user was scraped (perhaps with some overlap).
Until a complete walk of a user is accomplished once, halting will restart the process.
Yeah, I know it pulls the joined date. I don't really see an advantage doing it the current way other than you get the more recent stuff first. You have to start over for a user (or keep a date range saved) instead of just the most recent tweet downloaded, which can be easily resumed from regardless of whether a full crawl was done yet.
Uh, that's a valid argument.
Mostly, I already track the last time I scraped each artist, so this can bolt onto that without much work.
OTOH, I think I've thought about the approach more in this discussion then I did in the actual implementation of the thing.
Holy shit, it's still going. Is it looping? I don't understand.
Depending on how many tweets people have, it wouldn't surprise me.
It's running a single thread, and fetches 50 tweets every 5 seconds. If someone has 20K tweets, that's 2000 seconds, or 33 minutes for a single target. If you have 154 people you're scraping, that's 85 hours.
Now, not everyone has 20K tweets, but there's other limiting factors in terms of rate as well.
Frankly, for the volume of people I scrape, it wouldn't surprise me if it took weeks for the first run.
I feel like 154 people should not be taking this long. It has done a full run in the past, so I'm not entirely sure why it's taking this long this time.
Then again, I think that was before you implemented the new code that makes it not miss any tweets, but still. I had plenty of this stuff scraped prior.
Yeah, right now it doesn't help much if stuff is already retrieved.
You can speed stuff up by turning up the active thread count, but I don't know how sensitive, if at all, twitter is about scraping their stuff. A lot of the timeouts are fairly conservative because I don't know what twitter's rate limits are (if any).
I've emailed you an error I've seen popping up lately, I'm concerned it might be damaging the overall retrieval process.
It finally finished this morning, thank goodness. I was getting worried. I'm still wondering about that error I emailed you though, I'm concerned it will lead to tweets being skipped over.
After that, each subsequent run would just be from the latest date to whenever the last time that user was scraped (perhaps with some overlap).
Sadly, this doesn't seem to have worked. I started the scraper up again right after my last comment and it's been running since. I'm seeing it scrape the accounts of users it scraped when it successfully completed last run.
Oh, I haven't actually done that part yet. That's my intended design, not the current implementation (which is maximally dumb).
Fortunately, I have some time off for the holidays, so I should be able to get that done soon.
The error you e-mailed me about is probably harmless. I suspect it's not handling deleted tweets properly, but I haven't had time to investigate.
You can speed stuff up by turning up the active thread count,
This doesn't seem to work. I checked the running processes and the thread counts with only the twitter scraper running. It doesn't seem like it's going any faster and there's only one thread for it running.
You are correct. I had forgotten I had overridden the threading stuff because that will allow me to access the artist run time stuff. Gah.
\<Insert wet noises>
Ok, it should now remember where it left off. I also ran the fetch the other direction, mostly because it makes things neater (I can continuously update the fetch progress as it runs, so if things get interrupted it'll resume properly).
@fake-name If I may make a suggestion, if a username hits an error, it should move to the next username, so it'll keep retrying that part it errored at later (for debugging, like with that error I emailed you)
Other than errors, the net blipping out might also cause a chunk of tweets getting skipped.
I'm scraping 154 people on Twitter. According to Process Hacker, my latest run started 13 hours and 30 minutes ago. It's still going strong, I'm running it in a console I can easily monitor.
But I think with the sleeping it has to do combined with going through everyone's entire twitter history every single run (even though it's impossible to update old tweets on Twitter) just makes it take forever. Since it starts from the past and works its way toward the present, would it be reasonable to request it track where it left off for each user (probably writing it during the sleep)?
My biggest concern is missing new submissions. Sometimes tweets get taken down within a few hours (the most common reason being a commissioner throwing a fit). If TwitGet were faster, I could schedule it to run 1 hour after when it last ran through the script I made (instead of scheduling it via system time).
Also, having up to 10% of my CPU being taken up all day kinda sucks.