euagendas / m3inference

A deep learning system for demographic inference (gender, age, and individual/person) that was trained on massive Twitter dataset using profile images, screen names, names, and biographies
http://www.euagendas.org
GNU Affero General Public License v3.0
145 stars 57 forks source link

Efficient collection of large list of screen-names/ids via Twitter API #7

Open computermacgyver opened 4 years ago

computermacgyver commented 4 years ago

Currently the infer_screen_name and infer_id methods in M3Twitter accept one screen-name/id and call the Twitter API to get information for that single user. This is inefficient since the endpoint can get up to 100 users at a time.

New methods should be included in the M3Twitter class to handle a long list of users. These methods should break the list into chunks of 100, respect the rate limit, and gracefully handle any API errors.

(This was previously not needed as the class was scraping profiles from HTML and was designed simply as a demonstration method rather than something to be used at scale. The change recently made to use the API opens up this opportunity, which would make the library even more user-friendly)

JanaLasser commented 2 years ago

Any updates on this enhancement, or ideas for a workaround? I am very interested in getting this to work. If you could give me a pointer on where to start, I could potentially implement it.

computermacgyver commented 2 years ago

Hi @JanaLasser . We haven't done this work.

We would need to first write a function that takes a list of user ids or screen names and checks them with the Twitter API using the /1.1/users/lookup.json end point. This is documented here: https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/follow-search-get-users/api-reference/get-users-lookup

It accepts up to 100 users at a time.

After that we would download the profile images and then transform the data to be ready for processing. Functions for these exist but are single threaded; so, may be slow. I would leave them for now, however, and focus on the first step of using the /users/lookup.json endpoint.

JanaLasser commented 2 years ago

I created a pull request (https://github.com/euagendas/m3inference/pull/30) where I implemented the changes. I hope this is the right way (first time ever pull request...).

So far there is only code for user ID lists (not user name lists). The code does handle lists with >100 IDs by chunking them into bits of 100 IDs each.

It also doesn't explicitly respects the API rate limit and will fail with an "Invalid response from Twitter" if the rate limit is exceeded (similar to the single user lookup).

zijwang commented 2 years ago

Thank you @JanaLasser for the PR. It looks nice and I left a few comments there. I do not have a set of API keys handy -- it would be fantastic if @computermacgyver could help test when these comments were resolved.