Hanging queries with Twitter API, v2 (Academic Track) when date range is specified

We have our 4CAT instance up and running thanks to the wonderful docker container (thanks for advice and support @stijn-uva)🎈 🥳

When testing and setting up data collections @lilianabounegru has found that API queries with a date range (i.e. beyond the default 30 day window) seem to hang.

This is the case even with relatively small queries (e.g. 10 items) and with relatively small date ranges (e.g. 1 day). These queries remain uncompleted for days with the little sand timer icon on the dataset page and with "twitterv2-search" items remaining in the worker queue, and we've yet to successfully complete a dataset with a date range.

Screenshot 2021-10-16 at 06 24 44

Also it seems that as these queries are not completing, other queries are also stalled.

We are set up to collect Twitter data using API.v2 (Academic Track) and have no issues getting data without date parameters.

The API tokens/keys definitely in other contexts (e.g. with twarc), and no issues with getting data within window of previous 30 days. We also checked the same bearer token on OILab 4cat instance and datasets complete without issue.

So we were wondering:

Has anyone else had similar issues with getting historical Twitter data (i.e. running queries with date parameters provided) with 4CAT? Any clues about possible fixes / debugging?
While we get to the bottom of this is there a way to stop / remove items from the worker queue? (I had a look at documentation on wiki, apols if I missed this - if not there perhaps useful for others in future too? Perhaps also relates to #187)

Though I'm still relatively new to 4CAT I've been exploring temporary fixes for removing things from worker queue and leaving this here in case useful for others...

If running 4CAT with Docker, one can explore the database using psql with the following:

docker exec -it db /bin/bash
psql fourcat fourcat

To list tables in the database:

\d

Which shows the following:

              List of relations
 Schema |       Name       |   Type   |  Owner  
--------+------------------+----------+---------
 public | access_tokens    | table    | fourcat
 public | datasets         | table    | fourcat
 public | datasets_id_seq  | sequence | fourcat
 public | jobs             | table    | fourcat
 public | jobs_id_seq      | sequence | fourcat
 public | metrics          | table    | fourcat
 public | users            | table    | fourcat
 public | users_favourites | table    | fourcat
(8 rows)

To show items in jobs:

SELECT * from jobs;

Which lists jobs with the following structure

fourcat=# SELECT * from jobs;
 id |       jobtype        |            remote_id             | details | timestamp  | timestamp_after | timestamp_lastclaimed | timestamp_claimed | status | attempts | interval 
----+----------------------+----------------------------------+---------+------------+-----------------+-----------------------+-------------------+--------+----------+----------

Then to remove a specific job:

DELETE FROM jobs
WHERE id = [the id of the job that you would like to delete];

This succeeds in removing jobs - but not sure whether this is advisable or may lead to problems or unexpected behaviours later on. Any advice on best practices for removing jobs would be much appreciated!

Thanks for letting us know about your issue. Could you check the logs for one or more of these queries?

There is a general log that should show whenever a dataset has an error:

Log into the backend container docker exec -it 4cat_backend /bin/bash
Check the main log cat 4cat.log or for only the last -n number of line tail -n 20 4cat.log
For additional information, each dataset and processor has a log. You can find these by looking in the data folder (on the 4cat_backend container)
- Get the dataset's key either from the 4cat.log or from the url to use below (e.g., "b48a2fbec983f8e4da1d2184b073feac" from url https://[domain]/results/b48a2fbec983f8e4da1d2184b073feac/) cd data ls | grep dataset_key
You can now view the log for the specific process (in your case it'll be whatever name you gave your dataset - dataset_key) cat dataset_name-dataset_key.log

jwyg's response will delete the job. There is a helper script to completely remove a dataset (found in the helper-scripts folder) for deleting all associated analysis, but you should not require that since you are only attempting to first create the dataset. I will check but do not believe we've added anything to the control panel to remove a query while it is running (you can remove them from the interface when completed however).

I am very curious to see the logs because when a job fails it should completely stop and immediately continue on to new queued jobs. It should not attempt the failed job until a restart of 4CAT is preformed (it will try again every time you restart 4CAT and a restart occurs every time you stop and start the docker container). I hope this helps some!

Thanks so much for this swift and very helpful reply @dale-wahl! (Also just for context, I'm both the original poster, and the replier to my own post as I was figuring out how to find and remove jobs with psql! 😆 )

Here is a lightly redacted excerpt from the logs if you have time for a look!

16-10-2021 06:03:13 | INFO (api.py:885): Local API listening for requests at 4cat_backend:4444
16-10-2021 06:03:15 | INFO (processor.py:885): Running processor twitterv2-search on dataset [id]
16-10-2021 06:03:15 | WARNING (processor.py:885): Job twitterv2-search/[id] was queued for a dataset already marked as finished, deleting...
16-10-2021 06:11:45 | INFO (processor.py:885): Running processor twitterv2-search on dataset [id]
16-10-2021 06:11:45 | INFO (search.py:885): Querying: {'query': '[query] ', 'api_bearer_token': '[token]', 'api_type': 'all', 'min_date': None, 'max_date': None, 'amount': 10, 'user': '[user]', 'datasource': 'twitterv2', 'type': 'twitterv2-search', 'pseudonymise': True, 'label': '[label]'}
16-10-2021 06:13:13 | INFO (logger.py:249): Compiling logs into report
16-10-2021 06:29:03 | INFO (logger.py:885): Compiling logs into report
16-10-2021 06:41:23 | INFO (logger.py:885): Compiling logs into report
16-10-2021 06:52:02 | INFO (logger.py:885): Compiling logs into report
16-10-2021 07:02:30 | INFO (logger.py:885): Compiling logs into report
16-10-2021 07:16:32 | INFO (logger.py:885): Compiling logs into report
16-10-2021 07:27:02 | INFO (logger.py:885): Compiling logs into report
16-10-2021 07:40:34 | INFO (logger.py:885): Compiling logs into report
16-10-2021 07:54:54 | INFO (logger.py:885): Compiling logs into report
16-10-2021 17:26:50 | INFO (api.py:885): No input on API call from [address] - closing
16-10-2021 17:26:50 | INFO (logger.py:885): Compiling logs into report
16-10-2021 22:29:56 | INFO (api.py:885): No input on API call from [address] - closing
16-10-2021 22:29:56 | INFO (logger.py:885): Compiling logs into report
17-10-2021 00:23:33 | INFO (api.py:885): No input on API call from [address] - closing
17-10-2021 00:23:33 | INFO (logger.py:885): Compiling logs into report
17-10-2021 00:56:01 | INFO (api.py:885): No input on API call from [address] - closing
17-10-2021 00:56:01 | INFO (logger.py:885): Compiling logs into report
17-10-2021 01:52:53 | INFO (api.py:885): No input on API call from [address] - closing
17-10-2021 01:52:53 | INFO (logger.py:885): Compiling logs into report
17-10-2021 04:49:45 | INFO (api.py:885): No input on API call from [address] - closing
17-10-2021 04:49:45 | INFO (logger.py:885): Compiling logs into report
17-10-2021 04:49:45 | INFO (api.py:885): No input on API call from [address] - closing
17-10-2021 04:49:48 | INFO (api.py:885): No input on API call from [address] - closing
17-10-2021 04:49:50 | INFO (api.py:885): No input on API call from [address] - closing
17-10-2021 04:49:52 | INFO (api.py:885): No input on API call from [address] - closing
17-10-2021 04:49:52 | INFO (api.py:885): No input on API call from [address] - closing
17-10-2021 04:49:53 | INFO (api.py:885): No input on API call from [address] - closing
17-10-2021 04:49:55 | INFO (api.py:885): No input on API call from [address] - closing
17-10-2021 07:13:25 | INFO (api.py:885): No input on API call from [address] - closing

For datasets which are hanging (as per the one in the screenshot above), nothing comes up when I do ls | grep dataset_key (where dataset_key is the part of the URL as you mention above). Datasets which have completed show up as you indicate, with both csv and log file.

Any ideas what could be going wrong?

(Also if this could be useful for other 4CAT users and has not been done already, perhaps I could assist with documentation on the wiki covering some of the things you suggest above?)

You removed and replaced the variables with [id], [address], etc., correct? For a moment, I thought something very odd was occurring.

The line Job twitterv2-search/[id] was queued for a dataset already marked as finished, deleting... occurs when a query is exactly identical to a query that was previously created. Do you already have a dataset that used the same query? We actually just modified that by adding a time of creation component in a recent (commit)[https://github.com/digitalmethodsinitiative/4cat/commit/7e5d64a2cf21ecb6e0f569eb2decbd053da69f72]; it was supposed to prevent, well, duplicate datasets, but sometimes was overly strict. You could pull from the commit linked and see if it resolves your issue (if you are trying to query using the same parameters).

Hey @jwyg , the only explanation I can think of based on this report is that the request to Twitter times out, but because no explicit timeout is set in the code it keeps waiting for a response from Twitter indefinitely.

Commit https://github.com/digitalmethodsinitiative/4cat/commit/8129438216ec2535181d562f42334fc71bd74a70 adds an explicit timeout which should at least trigger an error if this occurs, rather than the query 'hanging' as it does now.

Can you try it with this commit and see if the issue occurs in the same way? If this is indeed the issue the question remains why it would time out, but at leas then we know what to look for...

The issue turned out to be that the 4cat backend daemon was not running inside the Docker container. While that is potentially an issue in itself, it is not specific to the Twitter datasource, so this issue can be closed for now.

digitalmethodsinitiative / 4cat

Hanging queries with Twitter API, v2 (Academic Track) when date range is specified #189