digitalmethodsinitiative / dmi-tcat

Digital Methods Initiative - Twitter Capture and Analysis Toolset
Apache License 2.0
367 stars 114 forks source link

Long conjunctive query phrases make TCAT not start #334

Open xmacex opened 6 years ago

xmacex commented 6 years ago

Hi. The following, long conjunctive query phrase #Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem will make TCAT not restart after the routine reload which is performed shortly after a query bin is added, as observed in logs/controller.php

2018-09-30 19:38:02 enforcing reload of config for track 2018-09-30 19:38:02 sending a TERM signal to track for 10060 2018-09-30 19:38:02 waiting for graceful exit of script track with pid 10060

I am waiting for the process dmitcat_track.php to respawn via the normal cronjob by monitoring its appearance in ps afx |grep 'var/www/dmi-tcat' |grep "track" listing. It won't be there. Removing the bin from the usual web UI brings the dmitcat_track.php back, after the routine reload is done (in a dozen or so seconds).

screen shot 2018-09-30 at 19 52 08

The query phrase is long, 142 characters. If I arbitrarily split it into two by placing a disjuctive comma in the middle #Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit, #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem the same issue still persists.

However, if I further split these into two disjunctions each, #Chinabigbrother #Citizenrating, #Chinasurveillance #Socialcredit, #Chinablackmirror #Digitaldictatorship, #Digitalcensorship #Socialcreditsystem, things work as usual. Of course any actual query design is out the window by this point (if it ever was there, but that's an another matter).

I observed the unabridged query phrase go into the database. The bin gets it's entries in tcat_query_bins_* tables, as usual.

I am not particularly expert in PHP / SQL webapp debugging, so if someone can please guide me how to get more information about the program flow, where to request debug messages, throw PHP exceptions and stack traces, or maybe even run the dmitcat_track.php under XDebug or another debugging tool, I would appreciate it.

I hope I will be able the send a pull request once I have a better idea where to get more information what goes wrong.

xmacex commented 6 years ago

Oops sorry I was neglecting the logs/track.error.log, silly me. With this TCAT query bin in effect, we get a HTTP/1.1 406 Not Acceptable response, and the dmitcat_track.php terminates silently terminates. I won't be able to paste in the whole thing from the log file for security reasons, sorry. I'm afraid there is more TCAT source code reading for me to do...

xmacex commented 6 years ago

Twitter Standard streaming API request parameters says about track

Each phrase must be between 1 and 60 bytes, inclusive.

Does it make any sense to document this issue like so, without writing this test against TCAT's codebase

<?php
/**
 * See https://github.com/digitalmethodsinitiative/dmi-tcat/issues/334 for the issue
 */
use PHPUnit\Framework\TestCase;

final class QueryPhraseTest extends TestCase
{
    /** 
     * Test that any phrase is within the Twitter API spec
     * https://developer.twitter.com/en/docs/tweets/filter-realtime/guides/basic-stream-parameters.html
     *
     * @dataProvider phraseProvider
     */
    public function testPhraseLengthIsWithinTwitterApiTrackSpecs($phrase)
    {
        $minLength = 0;
        $maxLength = 60;
        $this->assertGreaterThanOrEqual($minLength, strlen($phrase));
        $this->assertLessThanOrEqual($maxLength, strlen($phrase));
    }

    /**
     * Data providers
     */
    public function phraseProvider()
    {
        return [
            ["#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem"],
            ["#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit",
             "#Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem"],
            ["#Chinabigbrother #Citizenrating",
             "#Chinasurveillance #Socialcredit",
             "#Chinablackmirror #Digitaldictatorship",
             "#Digitalcensorship #Socialcreditsystem"],
        ];
    }
}

Which currently gives us

PHPUnit 7.3.5 by Sebastian Bergmann and contributors.

FF. 3 / 3 (100%)

Time: 122 ms, Memory: 10.00MB

There were 2 failures:

1) QueryPhraseTest::testPhraseLengthIsWithinTwitterApiTrackSpecs with data set #0 ('#Chinabigbrother #Citizenrati...system') Failed asserting that 142 is equal to 60 or is less than 60.

/Users/maco/tip/QueryPhraseTest.php:20

2) QueryPhraseTest::testPhraseLengthIsWithinTwitterApiTrackSpecs with data set #1 ('#Chinabigbrother #Citizenrati...credit', '#Chinablackmirror #Digitaldic...system') Failed asserting that 64 is equal to 60 or is less than 60.

/Users/maco/tip/QueryPhraseTest.php:20

FAILURES! Tests: 3, Assertions: 12, Failures: 2.

Anyway, towards a solution I would be inclined to fortify the web UI from https://github.com/digitalmethodsinitiative/dmi-tcat/blob/85504ded7fc9ad0274f8e977d23e80e2961d8ac3/capture/index.php#L488-L493

to something like

                if(params['type']=='track') {
                    var _nrOfPhrases = validateNumberOfPhrases(params['oldphrases'].split(",").length,_newphrases.split(",").length);
                    if(!_nrOfPhrases) {
                        alert("With this query you will exceed the number of allowed queries (400) to the Twitter API. Please reduce the number of phrases.");
                        return false;
                    }
                    var _lenOfPhrases = validateLengthOfPhrases(_newphrases);
                    if(!_lenOfPhrases) {
                        alert("One or more of the phrases exceeds the allowed length (60 characters) to the Twitter API. Please shorten your phrases.");
                        return false;

with validateLengthOfPhrases(phrases) being something like phrases.split(',').every(phrase => phrase.length <= 60). Some fortification further down the stack would be good too, I imagine, before phrases are accepted into the database.

xmacex commented 6 years ago

Or alternatively fortify the web UI in here https://github.com/digitalmethodsinitiative/dmi-tcat/blob/85504ded7fc9ad0274f8e977d23e80e2961d8ac3/capture/index.php#L750

ErikBorra commented 6 years ago

Hi @xmacex,

thanks for submitting this issue.

As you stated, the phrase does not make any sense as it would only track tweets which contain exactly "#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem".

TCAT should indeed validate the length of phrases, although it should not insert arbitrary comma's thus modifying the phrases input by the user. Instead, the user could get a warning indicating that the phrase to track is too long.

Best,

Erik

xmacex commented 6 years ago

@ErikBorra I suggest let's keep the query design (Rogers 2017; and of course a whole body of literature in information retrieval) issues separate from input string validation. Of course as STS scholars this is a moment of reflection about how this divide exactly gets constantly made and re-made in the everyday practices 😆

xmacex commented 6 years ago

Anyway I'd be happy to work on a pull request. I imagine input validation at JS as suggested in https://github.com/digitalmethodsinitiative/dmi-tcat/issues/334#issuecomment-425757821, but if someone can nudge me towards what would be a good place to introduce some fortification further down the stack, in I'll do that too while I'm on it. The function create_new_bin https://github.com/digitalmethodsinitiative/dmi-tcat/blob/4613a1518afc9ef12b82f395a40d8b5871861404/capture/query_manager.php#L42 looks like a good candidate

xmacex commented 6 years ago

Working on it here https://github.com/xmacex/dmi-tcat/tree/phrases_must_not_exceed_60chrs_each. I really don't know how to provide tests for Javascript functions buried in PHP 😟

xmacex commented 6 years ago

A terminological observation about myself: I seem to have converged towars speaking of kitten,lizard,'large aardvark',anteater as one query containing four phrases.

dentoir commented 6 years ago

Hi @xmacex

Sorry for the late reply on this and thank you for your pull request. It looks sound in principle. Some quick remarks before I can merge this. Maybe we should add a maximum length to the @username in the follow role as well (which appears to be 15 characters). There is a third location where (individual) phrases are validated, which is validate_capture_phrases() in common/functions.php and is called by CLI scripts such as search.php. Finally, could you detach your unit test from the pull request? It would leave a script in the tree which has non-present dependencies.

xmacex commented 6 years ago

Thanks @dentoir. The Twitter API documentation for follow does not seem to state that the lengths of usernames are limited as API query time (of course on new user registration, yes). Exceeding the 60 characters limit for a track, the Twitter API responds with a HTTP/1.1 406 Not Acceptable. Does it give an error in the HTTP response, and freak out TCAT if asked to follow a user name too long? I am not sure if I can currently verify that myself.

I implemented validate_capture_phrases() hopefully as suggested in commit https://github.com/digitalmethodsinitiative/dmi-tcat/pull/335/commits/df582eebe1bbb34e8c0d441cd3133a83d70af5a6, sorry I missed that. Thanks for pointing it out. If someone can take a quick review of the PR #335, I'll detach the test suite and nudge the pull request.