Getting historical tweets by user

mstefanro commented 10 years ago

Stuff found by Fabian: http://gnip.com/pages/twitter-data/ http://topsy.com/s?q=can%27t%20afford%20food&window=a&type=tweet http://snapbird.org/ https://github.com/remy/snapbird http://topsy.com/tweets

mstefanro commented 10 years ago

@radhaus Can you get the userid of a person for which the twitter api gives you permission denied on getting his tweets, and then try accessing his profile directly through the web? Let me know if it works.

radhaus commented 10 years ago

I actually already tried that. The answer is no, it the response says the data is private, it's private through a browser too. The effect I showed you yesterday was just the rate being exceeded after the first ~280 requests.

mstefanro commented 10 years ago

But we can still get a large number of userids whose tweets are public, right?

On 04/02/2014 05:34 PM, radhaus wrote:

I actually already tried that. The answer is no, it the response says the data is private, it's private through a browser too. The effect I showed you yesterday was just the rate being exceeded after the first ~280 requests.

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/3#issuecomment-39345175.

radhaus commented 10 years ago

Yes. The (overwhelming) bottleneck is extracting the tweets themselves.

grill commented 10 years ago

With this site you can do a sophisticated (better than twitter) search of the last 1000 days on Twitter (+analytics) and download everything as csv (tweets [I think only id??], wordcount, other analytics) (looks cool so far, but direct api access seems to cost 500$): http://gr.peoplebrowsr.com/ Fabian said he will also ask at: http://datasift.com/platform/historics/ We could also try querying through google (date range, ops: and, or, except, ...) since twitter search dates only 10 days back, but looking for location is probably not possible: https://encrypted.google.com/search?q=site%3Atwitter.com+%22looking+for%22+OR+%22anyone+help%22+OR+%22%28would+OR+can%29+you+recommend%22+OR+%22anyone+%28recommend+OR+know%29%22+OR+%22anyone+recommend%22+OR+%22anyone+recommend%22+-RT&hl=en&prmdo=1&biw=1920&bih=943&sa=X&ei=4R88U9e0H4m54ASB34CABw&ved=0CBkQpwUoBg&source=lnt&tbs=sbd%3A1%2Ccdr%3A1%2Ccd_min%3A4%2F9%2F2013%2Ccd_max%3A8%2F9%2F2013&tbm=#hl=en&q=site%3Atwitter.com+%22meal%22+OR+%22food%22+-RT&tbs=sbd:1%2Ccdr:1%2Ccd_min:4%2F9%2F2013%2Ccd_max:8%2F9%2F2013

mstefanro commented 10 years ago

@grill

It is restricting by location that is the problem. Otherwise there are many historical datasets of tweets: https://archive.org/details/twitterstream http://snap.stanford.edu/data/ http://www.infochimps.com/datasets/twitter-census-hashtags-urls-smileys-by-day http://topsy.com/tweets

And Google Search doesn't have an API afaik anyway. We'd have to scrape it.

On 04/02/2014 05:53 PM, Gabriel Grill wrote:

With this site you can do a sophisticated (better than twitter) search of the last 1000 days on Twitter (+analytics) and download everything as csv (looks cool so far, but direct api access seems to cost 500$): http://gr.peoplebrowsr.com/ Fabian said he will also ask at: http://datasift.com/platform/historics/ We could also try querying through google (date range, ops: and, or, except, ...) since twitter search dates only 10 days back, but looking for location is probably not possible: https://encrypted.google.com/search?q=site%3Atwitter.com+%22looking+for%22+OR+%22anyone+help%22+OR+%22%28would+OR+can%29+you+recommend%22+OR+%22anyone+%28recommend+OR+know%29%22+OR+%22anyone+recommend%22+OR+%22anyone+recommend%22+-RT&hl=en&a mp;prmdo=1&biw=1920&bih=943&sa=X&ei=4R88U9e0H4m54ASB34CABw&ved=0CBkQpwUoBg&source=lnt&tbs=sbd%3A1%2Ccdr%3A1%2Ccd_min%3A4%2F9%2F2013%2Ccd_max%3A8%2F9%2F2013&tbm=#hl=en&q=site%3Atwitter.com+%22meal%22+OR+%22food%22+-RT&tbs=sbd:1%2Ccdr:1%2Ccd_min:4%2F9%2F2013%2Ccd_max:8%2F9%2F2013 https://encrypted.google.com/search?q=site%3Atwitter.com+%22looking+for%22+OR+%22anyone+help%22+OR+%22%28would+OR+can%29+you+recommend%22+OR+%22anyone+%28recommend+OR+know%29%22+OR+%22anyone+recommend%22+OR+%22anyone+recommend%22+-RT&hl=en&prmdo=1&biw=1920&bih=943&sa=X&ei=4R88U9e0H4m54ASB34CABw&ved=0CBkQpwUoBg&source=lnt&tbs=sbd%3A1%2Ccdr%3A1%2Ccd_min%3A4%2F9%2F2013%2Ccd_max%3A8%2F9%2F2013&tbm=#hl=en&q=site%3Atwitter.com+%22meal%22+OR+%22food%22+-RT&tbs=sbd:1%2Ccdr:1%2Ccd_min:4%2F9%2F2013%2Ccd_max:8%2F9%2F2013

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/3#issuecomment-39347632.

mstefanro commented 10 years ago

@radhaus

Can you please push your userid extraction code?

On 04/02/2014 05:45 PM, radhaus wrote:

Yes. The (overwhelming) bottleneck is extracting the tweets themselves.

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/3#issuecomment-39346612.

f4bD3v commented 10 years ago

Aleks just sent this link: https://archive.org/details/twitterstream

mstefanro commented 10 years ago

Yes, I mentioned it in my previous post as well,along with a few other sources. The problem with all of them is that the data does not only come from India, and I don't know if we can filter it out.

On 04/02/2014 06:37 PM, Fabian Brix wrote:

Aleks just sent this link: https://archive.org/details/twitterstream

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/3#issuecomment-39353088.

mstefanro commented 10 years ago

@radhaus I guess this technique has a chance for working. Can you finish your implementation?

Some notes:

You should probably download all tweets of a person (i.e. make as many calls to get_user_timeline as needed) -- please confirm that get_user_timeline gives all of a user's tweets (upto a limit), not just the most recent ones; otherwise, we would have the same problem: no historical data
You should attempt to filter only persons that are from India (get_user_timeline or a similar api call should provide some location information. please check)
You should save all the twitter data that you extract (ideally serialize the dictionaries with pickle); just the tweet text is not sufficient -- you can look at how I did it in my implementation and reuse the code
If possible, try to count your queries and limit them; that is, when you reach the threshold in a 15minute window, sleep for 15minutes; it would be great if we could run this for a day and see what happens; or you can just execute your query and when the api throws a "too many queries" exception you execute a sleep
Introduce some randomness in the code (that is, don't pick all followers of all followers, just a subset); this would allow many of us to run the code in each of our machines and not download the same data

You have like 300queries/15min, each query gives ~200 tweets, so 60k tweets/30min is what we seem to be able to do. That'd ideally be 2.8mil/day.

radhaus commented 10 years ago

Well I agree with your plan but I think your estimates are a bit optimistic.

On the one hand I did manage to harvest ~30K tweets from a small-time account, @pythoncentral. I think because it's a specialty account, a high proportion of its users are active and quite prolific (often 1000s tweets each).

However, the problem with more popular accounts is you get a lot of inactive followers and so you burn requests with no data return and this drags the data collection rate down.

I tried collecting from the most recent followers of @KareenaOnline and then comparing this with skipping the ~5000 most recent followers and collecting from there, but there didn't seem to be much difference (~7K tweets in both cases). Maybe going even further back through the followers list would eventually give more active users.

As for historic tweets, you can collect up to 3200 of a user's most recent tweets (in lots of 200). The number of tweets they have made (as well as all their other profile data) is embedded in every tweet so you can check in advance if they have >200 tweets and decide whether it's worth spending an extra request on them.

What would be ideal is some costless way of telling if a user is very active or not, because then you could skip the duds and maximise your collection rate.

I still find that most users are geo-anonymous. Sometimes you can alternatively find account time-zone eg. "Chennai", "New Dehli" etc. embedded in tweets but this too seems to be omitted more often than not. For this reason we should identify purely local celebrities (e.g. television stars, regional language musicians) and rely on the heuristic that the majority of followers will also be local.

f4bD3v commented 10 years ago

Have you found any statistics of how many tweets an ordinary twitter user publishes on average?

The Indian Celebrities to follow you can find on the following websites including rankings:

http://www.indiancelebsontwitter.com/ (numbers not up to date)

http://www.shoutingblogger.com/2012/10/top-10-most-followed-bollywood.html

http://timesofindia.indiatimes.com/tech/slideshow/twittercelebs/Hrithik-Roshan/itslideshow/22184410.cms

2014-04-05 14:05 GMT+02:00 radhaus notifications@github.com:

Well I agree with your plan but I think your estimates are a bit optimistic.

On the one hand I did manage to harvest ~30K tweets from a small-time account, @pythoncentral. I think because it's a specialty account, a high proportion of its users are active and quite prolific (often 1000s tweets each).

However, the problem with more popular accounts is you get a lot of inactive followers and so you burn requests with no data return and this drags the data collection rate down.

I tried collecting from the most recent followers of @KareenaOnline and then comparing this with skipping the ~5000 most recent followers and collecting from there, but there didn't seem to be much difference (~7K tweets in both cases). Maybe going even further back through the followers list would eventually give more active users.

As for historic tweets, you can collect up to 3200 of a user's most recent tweets (in lots of 200). The number of tweets they have made (as well as all their other profile data) is embedded in every tweet so you can check in advance if they have >200 tweets and decide whether it's worth spending an extra request on them.

What would be ideal is some costless way of telling if a user is very active or not, because then you could skip the duds and maximise your collection rate.

I still find that most users are geo-anonymous. Sometimes you can alternatively find account time-zone eg. "Chennai", "New Dehli" etc. embedded in tweets but this too seems to be omitted more often than not. For this reason we should identify purely local celebrities (e.g. television stars, regional language musicians) and rely on the heuristic that the majority of followers will also be local.

Reply to this email directly or view it on GitHubhttps://github.com/fabbrix/humanitas/issues/3#issuecomment-39636092 .

mstefanro commented 10 years ago

@radhaus Looking through the api, I think you should use followers/list rather than followers/id. followers/list gives you 200 users and you are allowed to make 30 queries. But this gives you their location and their number of tweets (statuses_count) so you can easily filter out everyone whose location does not match any city/region from India or who does not have at least 200 tweets.

A strategy would be to first use followers/list until you reach the limit, then pick the best users and get their tweets.

radhaus commented 10 years ago

Whoa man nice going that's exactly what we need. As far as I can tell this should allow us to fully optimise the data collection rate (in principle 300 x 200 tweets / 15 mins). I'll work on this tomorrow.

mstefanro commented 10 years ago

I've written an example script now, should be ready in about 10mins. You can copy paste the functions from it into your code.

Stefan

On 04/06/2014 02:27 AM, radhaus wrote:

Whoa man nice going that's exactly what we need. As far as I can tell this should allow us to fully optimise the data collection rate (in principle 300 x 200 tweets / 15 mins). I'll work on this tomorrow.

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/3#issuecomment-39655128.

mstefanro commented 10 years ago

Okay, check example_user_filtering.py. I've run it on the first 20 pages of followers of BeingSalmanKhan, results:

Stats: Bad location: 48 (1.20%) No location: 38 (0.95%) Too few tweets: 3888 (97.20%) Accepted: 26 (0.65%)

Found "GusainSanjay" with 1852 tweets from New Delhi Found "DHANBAD_ASHISH" with 271 tweets from dhanbad,jharkhand Found "60mlLove" with 46439 tweets from Delhi Found "BigBlueChief" with 1654 tweets from Around The World Found "Coteire" with 293 tweets from New Delhi Found "tendol_forever" with 287 tweets from New Delhi Found "Imsathi14" with 269 tweets from chennai Found "PinkiDave3" with 807 tweets from mumbai Found "DEepank2310" with 481 tweets from KanpuR Found "Sriyansa" with 143 tweets from Bhubaneswar,Odisha,India Found "mahipat03311260" with 415 tweets from india Found "ShuvamSinha5" with 452 tweets from Bally,Howrah,West Bengal Found "saurabh2688" with 152 tweets from Mumbai Found "SandeepMelwan" with 236 tweets from india Found "naveenjn" with 437 tweets from Trivandrum Found "khwleel" with 515 tweets from Bangalore, India Found "rchtjn" with 537 tweets from Delhi Found "ASHU416" with 115 tweets from india Found "jaskiranbaidwan" with 110 tweets from india Found "IDOCPRASH" with 338 tweets from Karnataka Found "nenjack4u" with 288 tweets from india Found "SahilDhall91" with 107 tweets from New Delhi Found "WStandardHumanS" with 567 tweets from India Found "mr_riju" with 190 tweets from Howrah Dasnagar (westbengal) Found "RajibChakrabo16" with 390 tweets from Guwahati Found "pratik171090" with 165 tweets from Mumbai

It seems there are very few users matching all criteria. Also, the location check needs to be improved a bit.

On 04/06/2014 02:27 AM, radhaus wrote:

Whoa man nice going that's exactly what we need. As far as I can tell this should allow us to fully optimise the data collection rate (in principle 300 x 200 tweets / 15 mins). I'll work on this tomorrow.

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/3#issuecomment-39655128.

mstefanro commented 10 years ago

I've tried Twitter's web search to see if there are any good tweets from these users, but it seems not:

The query was:

expensive (from:DHANBAD_ASHISH OR from:60mlLove OR from:BigBlueChief OR from:Coteire OR from:tendol_forever OR from:Imsathi14 OR from:PinkiDave3 OR from:DEepank2310 OR from:Sriyansa OR from:mahipat03311260 OR from:SandeepMelwan OR from:naveenjn OR from:khwleel OR from:rchtjn OR from:ASHU416 OR from:jaskiranbaidwan OR from:nenjack4u OR from:SahilDhall91 OR from:WStandardHumanS OR from:mr_riju OR from:RajibChakrabo16 OR from:pratik171090)

Link to query:

https://twitter.com/search?q=expensive%20OR%20food%20from%3ADHANBAD_ASHISH%20OR%20from%3A60mlLove%20OR%20from%3ABigBlueChief%20OR%20from%3ACoteire%20OR%20from%3Atendol_forever%20OR%20from%3AImsathi14%20OR%20from%3APinkiDave3%20OR%20from%3ADEepank2310%20OR%20from%3ASriyansa%20OR%20from%3Amahipat03311260%20OR%20from%3ASandeepMelwan%20OR%20from%3Anaveenjn%20OR%20from%3Akhwleel%20OR%20from%3Archtjn%20OR%20from%3AASHU416%20OR%20from%3Ajaskiranbaidwan%20OR%20from%3Anenjack4u%20OR%20from%3ASahilDhall91%20OR%20from%3AWStandardHumanS%20OR%20from%3Amr_riju%20OR%20from%3ARajibChakrabo16%20OR%20from%3Apratik171090&src=typd&f=realtime

f4bD3v commented 10 years ago

What is the criteria for too few tweets? From a geographic perspective this approach works. The users are form all over India. Of course, if there are any relevant tweets at all, they will be very sparse, so will have to get way more users before we can effectively say our idea will not work.

2014-04-06 2:57 GMT+02:00 Stefan Mihaila notifications@github.com:

I've tried Twitter's web search to see if there are any good tweets from these users, but it seems not:

The query was:

expensive (from:DHANBAD_ASHISH OR from:60mlLove OR from:BigBlueChief OR from:Coteire OR from:tendol_forever OR from:Imsathi14 OR from:PinkiDave3 OR from:DEepank2310 OR from:Sriyansa OR from:mahipat03311260 OR from:SandeepMelwan OR from:naveenjn OR from:khwleel OR from:rchtjn OR from:ASHU416 OR from:jaskiranbaidwan OR from:nenjack4u OR from:SahilDhall91 OR from:WStandardHumanS OR from:mr_riju OR from:RajibChakrabo16 OR from:pratik171090)

Link to query:

https://twitter.com/search?q=expensive%20OR%20food%20from%3ADHANBAD_ASHISH%20OR%20from%3A60mlLove%20OR%20from%3ABigBlueChief%20OR%20from%3ACoteire%20OR%20from%3Atendol_forever%20OR%20from%3AImsathi14%20OR%20from%3APinkiDave3%20OR%20from%3ADEepank2310%20OR%20from%3ASriyansa%20OR%20from%3Amahipat03311260%20OR%20from%3ASandeepMelwan%20OR%20from%3An aveenjn%20OR%20from%3Akhwleel%20OR%20from%3Archtjn%20OR%20from%3AASHU416%20OR%20from%3Ajaskiranbaidwan%20OR%20from%3Anenjack4u%20OR%20from%3ASahilDhall91%20OR%20from%3AWStandardHumanS%20OR%20from%3Amr_riju%20OR%20from%3ARajibChakrabo16%20OR%20from%3Apratik171090&src=typd&f=realtimehttps://twitter.com/search?q=expensive%20OR%20food%20from%3ADHANBAD_ASHISH%20OR%20from%3A60mlLove%20OR%20from%3ABigBlueChief%20OR%20from%3ACoteire%20OR%20from%3Atendol_forever%20OR%20from%3AImsathi14%20OR%20from%3APinkiDave3%20OR%20from%3ADEepank2310%20OR%20from%3ASriyansa%20OR%20from%3Amahipat03311260%20OR%20from%3ASandeepMelwan%20OR%20from%3Anaveenjn%20OR%20from%3Akhwleel%20OR%20from%3Archtjn%20OR%20from%3AASHU416%20OR%20from%3Ajaskiranbaidwan%20OR%20from%3Anenjack4u%20OR%20from%3ASahilDhall91%20OR%20from%3AWStandardHumanS%20OR%20from%3Amr_riju%20OR%20from%3ARajibChakrabo16%20OR%20from%3Apratik171090&src=typd&f=realtime

Reply to this email directly or view it on GitHubhttps://github.com/fabbrix/humanitas/issues/3#issuecomment-39655589 .

mstefanro commented 10 years ago

@radhaus we can have a 10min meeting if you're at atrium

f4bD3v commented 10 years ago

how fast can we build a data structure of most indian users, which one are we going to use and how are we going to store it?

mstefanro commented 10 years ago

I don't see why we need any data structure. We just need a list of their screen names or ids, so we can grab their tweets.

On 04/07/2014 12:24 AM, Fabian Brix wrote:

how fast can we build a data structure of most indian users, which one are we going to use and how are we going to store it?

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/3#issuecomment-39685320.

f4bD3v commented 10 years ago

a list is a datastructure. So you want to put the ids of hopefully several million users in a file?

mstefanro commented 10 years ago

Why not? I don't see why everyone thinks that database inserts are somehow more efficient than appending to a file. We can load them into a database afterwards if anything other than traversal is needed. And don't get your hopes up, we're probably not going to have more than 20k users fitting the criteria. And in any case, do we even have a remote mongodb server that we can all access?

On 04/07/2014 12:35 AM, Fabian Brix wrote:

a list is a datastructure. So you want to put the ids of hopefully several million users in a file?

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/3#issuecomment-39685595.

mstefanro commented 10 years ago

I guess we can use windows azure to host our mongodb. And store some stuff there.

On 04/07/2014 12:42 AM, Stefan Mihaila wrote:

Why not? I don't see why everyone thinks that database inserts are somehow more efficient than appending to a file. We can load them into a database afterwards if anything other than traversal is needed. And don't get your hopes up, we're probably not going to have more than 20k users fitting the criteria. And in any case, do we even have a remote mongodb server that we can all access?

On 04/07/2014 12:35 AM, Fabian Brix wrote:

a list is a datastructure. So you want to put the ids of hopefully several million users in a file?

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/3#issuecomment-39685595.

duynguyen commented 10 years ago

@mstefanro That's the thing I'm working on. Will go to the TA today to get the server credentials, then play around to see if we can install mongodb on it or must wait until the M$ crab is provided

f4bD3v commented 10 years ago

@radhaus Can you post an update here once you're done for the day?

f4bD3v commented 10 years ago

@grill , @radhaus What is the current state of things?

radhaus commented 10 years ago

We made a few changes to the script and now it's ready to go. I've also picked out a set of roots for collection. I'd like to start ASAP.

f4bD3v commented 10 years ago

do you store the whole tweet json object now?

2014-04-21 10:57 GMT+02:00 radhaus notifications@github.com:

We made a few changes to the script and now it's ready to go. I've also picked out a set of roots for collection. I'd like to start ASAP.

— Reply to this email directly or view it on GitHubhttps://github.com/fabbrix/humanitas/issues/3#issuecomment-40924245 .

radhaus commented 10 years ago

Yeah rolled back to the pickle implementation

grill commented 10 years ago

@distribution script: I will send a "how-to" to everybody today, explaining how to upload your public keys fast and then how to send commands to the different machines.

@filtering of twitter archives: Filtering a archive takes up to 13h, that's why I think we should max. filter about 24 months. This would take about 2 days.

@quality of data A lot of tweets match the criterias, but most of them (as expected) with unknown location... But there is some data. I also found people tweeting actual prices. I also had to remove a couple of prediction & food words, because they produced many unuseful tweets. But I am still not really sure how good the data really is.

I've found no relevant geo-located tweet in India in December 2013, but in indonesia there are a couple actually and Malaysia also doesn't seem to be bad (the coordinate check for this first filtering is really simple and matches also part of Malaysia).

But looking for mentionend regions and looking up the user location (this is part of the tweet object) actually works okayish. All other filtered tweets are stored in a seperate file, which can then be filtered again with the userids.

f4bD3v commented 10 years ago

Thank you for your efforts!

So if we run the "user collection script" on 8 machines to get all 'useful' twitter users from India how long would this take?

Given the time constraints we should leave filtering the archives until we have collected enough data from the API with the user approach. Ideally we should try to get tweets dating back to 01.01.2009 so we have 5 years worth of tweets to work with.

The geolocated tweets you mention from Indonesia, are they in english?

Let's group the tweets from India into 3 evident categories:

1) geolocated tweets (a small collection, might be nice for visualization of examples later on a map)

2) tweets by "specific" user location (also allows grouping's by state, region, city etc. although biased of course)

3) general tweets from* India* (user location India, Indian user name?); we can use these when computing indicators for the whole country for example

So, the next challenge is to actually get these 5 years of tweets from India as fast as possible, allowing some of us to start building indicators. Once we have got the tweets we can run* filtering of the archives* on a subset of machines for a set of months and see if there is any complementary information. What do you think?

Duy will build a map where we can visualize the results from the twitter analysis, so your efforts will not be in vain.

2014-04-21 12:39 GMT+02:00 Gabriel Grill notifications@github.com:

@distribution https://github.com/distribution script: I will send a "how-to" to everybody today, explaining how to upload your public keys fast and then how to send commands to the different machines.

@filtering https://github.com/filtering of twitter archives: Filtering a archive takes up to 13h, that's why I think we should max. filter about 24 months. This would take about 2 days.

@quality https://github.com/quality of data A lot of tweets match the criterias, but most of them (as expected) with unknown location... But there is some data. I also found people tweeting actual prices. I also had to remove a couple of prediction & food words, because they produced many unuseful tweets. But I am still not really sure how good the data really is.

I've found no relevant geo-located tweet in India in December 2013, but in indonesia there are a couple actually and Malaysia also doesn't seem to be bad (the coordinate check for this first filtering is really simple and matches also part of Malaysia).

But looking for mentionend regions and looking up the user location (this is part of the tweet object) actually works okayish. All other filtered tweets are stored in a seperate file, which can then be filtered again with the userids.

— Reply to this email directly or view it on GitHubhttps://github.com/fabbrix/humanitas/issues/3#issuecomment-40928185 .

grill commented 10 years ago

The geolocated tweets you mention from Indonesia, are they in english?

some are english, some are indonesian

Let's group the tweets from India into 3 evident categories:

The archive-script does filtering by key words and categorization alreday and can just be reused. Categories: geo-location, place (one can specifiy a place with a tweet), region/city of india or india mentioned in tweet, user-location, rest By looking up the user id's through a common root (like a celebrity) we get a new category: follows celeb (and because of that probably from india) Geo-located tweets are really, really rare. I've looked through an archive of december 2013 and didn't find one which contains relevant keywords... So currently I do not expect a lot of tweets from this first category you mentioned, but we'll see :-) I think it's a good idea to seperate tweets into specific and general india durring our first processing step. The first category is alreday present.

What do you think?

I think we can run both approaches on all machines in parallel since they are not computionally expensive and run in seperate processes anyway. But I think finding out how long it would take to get tweets for the last 5 years has priority.

grill commented 10 years ago

I think I am a bit confused, sorry... What I meant was the loaded tweets from the accounts could be filtered and categorized on the fly. In the way it has been done with the archives. The key words used for filtering are sort of a first set of indicators (food, preictive word (e.g. increase, starve, ...)) and could be used to retrieve a first set of tweets which could be relevent. The other tweets could still be stored somewhere and later used for finding maybe other indicator words.

f4bD3v commented 10 years ago

done.

f4bD3v / humanitas

Getting historical tweets by user #3