UkraineNow-Intel / autoSA-backend

Django backend for autoSA
0 stars 1 forks source link

Implement a class / method to collect Twitter data and store in database #17

Closed j-bennet closed 2 years ago

j-bennet commented 2 years ago

You will probably be using the Search API:

https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets

implemented in Tweepy:

https://docs.tweepy.org/en/stable/api.html#tweepy.API.search_tweets

We already have the database model for this data, called api.models.Source:

https://github.com/UkraineNow-Intel/autoSA-backend/blob/911490be666460bdfb7ac29bd5cd8194d08c4e6c/api/models.py#L28

It's designed to represent a single event coming from scraping a website or from some external API, such as Twitter or Fractal. In this case, we're going to store tweets.

A Source has language and text, and possibly locations and translations.

Something like this would work to start with:

def collect_tweets(account_names: List, hashtags: List, date_from=None, date_to=None, keywords=None): -> List
    # collect tweets
    # convert them to dicts that have the same fields as Source
    # return a list of those
mgavish commented 2 years ago

@j-bennet I've pushed new branch twitter-data as initial commit, needs more work but still having trouble running models file and wanted to get a conversation started.

I've attempted this on windows OS with Conda environment, importing requirements.txt - gave me errors about fasttext Then attempted on Ubuntu with python venv, importing requirements.txt - gave me errors about contrab I don't think either of the above are contributing to errors running models.py

have been getting same error code when running models.py in the two above environments: File "/home/matan/autoSA-backend/UkrNow_venv/lib/python3.8/site-packages/django/conf/__init__.py", line 63, in _setup raise ImproperlyConfigured( django.core.exceptions.ImproperlyConfigured: Requested setting INSTALLED_APPS, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.

Found this potential solution but don't know what our equivalent of authors settings.py is.
https://alokkumar-17171.medium.com/django-error-django-core-exceptions-improperlyconfigured-ed0b9023cfc9

j-bennet commented 2 years ago

This is because you can't import a Django model into a Python console script. It's only available in the context of a running Django app. Solution: return dicts that will be stored as Sources by a scheduled task that will live under website.

Not this particular problem, but some of your problems may be related to packages not being properly installed.

  1. Make sure you create the venv using python 3.8 or 3.9. Check with python --version.
  2. Activate the venv.
  3. Install reqs with pip install -r requirements.txt -r requirements-dev.txt -U --upgrade-strategy only-if-needed. You need both regular and dev requirements.
mgavish commented 2 years ago

Re. MVP:

TLDR: We can get tweets by user from our list but Hashtags and Keywords are going to take more time + research and possibly access to the Academic Research Product Track. Question is, may we deliver an MVP without Hashtags and Keywords?

Searching for Hashtags is not available under api V1

We have been using client V2 and in this version, access to the (probable) required endpoint search_all_tweets is accessible only with Academic Research Access which looks to have a waiting list.

The work-around (as I see it) would be to scrape massive amounts of data and use regex. However, I'm not yet convinced that search_all_tweets is what we need as this post from the twitter developer community seems promising and needs to be explored, which I plan on doing tomorrow. Additionally, while hashtags and keywords are part of the final scope, I feel we can still deliver an MVP with only tweets from the procured list while attacking Hashtags and Keywords from two angles.

j-bennet commented 2 years ago

@mgavish

Filtered stream is one way to go, yes.

Another option is to use Search API v2. There's a Python package for that:

https://pypi.org/project/searchtweets-v2/

And the Github repo:

https://github.com/twitterdev/search-tweets-python/tree/v2

After installing the package, you can test it with an included console script called search_tweets.py, here is my query for example:

search_tweets.py \
  --credential-file .creds \
  --max-tweets 10 \
  --results-per-call 10 \
  --tweet-fields id,created_at,text,author_id,geo,possibly_sensitive,source,lang \
  --user-fields id,created_at,name,username,verified,location,url \
  --place-fields id,name,country_code,place_type,full_name,country,contained_within,geo \
  --expansions author_id \
  --start-time 2022-04-12T00:00 \
  --query "(Mariupol OR #Маріуполь OR @Reuters) -is:retweet" \
  --print-stream

(Note that the console app is not what you will be using in the end. It's just a handy testing tool.)

As you can see, the query can include keywords, hashtags, and usernames.

Oh, and for hashtags, we don't need has:hashtags operator. We need # operator, which is core. For usernames, @ operator:

https://developer.twitter.com/en/docs/twitter-api/tweets/search/migrate#:~:text=Core-,%23,-Available

For the full list of operators, see this:

https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#list

Currently, I have an apikey with Elevated access, it covers the core operators. If we need more, we'd have to get the key with Academic Research enabled.

There is an extremely helpful API testing page. You can visually select criterias, and this will generate a curl query:

https://developer.twitter.com/apitools/api?endpoint=%2F2%2Ftweets%2Fsearch%2Frecent&method=get

Oh, and tweepy also includes a wrapper for v2. I'm not set on using Tweepy though. It was Zach who used it and I don't know why he chose it. We can use whichever package you prefer. Here are the Tweepy API docs for recent search v2:

https://docs.tweepy.org/en/stable/client.html#search-tweets

I hope this can be of help.

j-bennet commented 2 years ago

PR: https://github.com/UkraineNow-Intel/autoSA-backend/pull/33