Open xmacex opened 6 years ago
Oops sorry I was neglecting the logs/track.error.log
, silly me. With this TCAT query bin in effect, we get a HTTP/1.1 406 Not Acceptable
response, and the dmitcat_track.php
terminates silently terminates. I won't be able to paste in the whole thing from the log file for security reasons, sorry. I'm afraid there is more TCAT source code reading for me to do...
Twitter Standard streaming API request parameters says about track
Each phrase must be between 1 and 60 bytes, inclusive.
Does it make any sense to document this issue like so, without writing this test against TCAT's codebase
<?php
/**
* See https://github.com/digitalmethodsinitiative/dmi-tcat/issues/334 for the issue
*/
use PHPUnit\Framework\TestCase;
final class QueryPhraseTest extends TestCase
{
/**
* Test that any phrase is within the Twitter API spec
* https://developer.twitter.com/en/docs/tweets/filter-realtime/guides/basic-stream-parameters.html
*
* @dataProvider phraseProvider
*/
public function testPhraseLengthIsWithinTwitterApiTrackSpecs($phrase)
{
$minLength = 0;
$maxLength = 60;
$this->assertGreaterThanOrEqual($minLength, strlen($phrase));
$this->assertLessThanOrEqual($maxLength, strlen($phrase));
}
/**
* Data providers
*/
public function phraseProvider()
{
return [
["#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem"],
["#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit",
"#Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem"],
["#Chinabigbrother #Citizenrating",
"#Chinasurveillance #Socialcredit",
"#Chinablackmirror #Digitaldictatorship",
"#Digitalcensorship #Socialcreditsystem"],
];
}
}
Which currently gives us
PHPUnit 7.3.5 by Sebastian Bergmann and contributors.
FF. 3 / 3 (100%)
Time: 122 ms, Memory: 10.00MB
There were 2 failures:
1) QueryPhraseTest::testPhraseLengthIsWithinTwitterApiTrackSpecs with data set #0 ('#Chinabigbrother #Citizenrati...system') Failed asserting that 142 is equal to 60 or is less than 60.
/Users/maco/tip/QueryPhraseTest.php:20
2) QueryPhraseTest::testPhraseLengthIsWithinTwitterApiTrackSpecs with data set #1 ('#Chinabigbrother #Citizenrati...credit', '#Chinablackmirror #Digitaldic...system') Failed asserting that 64 is equal to 60 or is less than 60.
/Users/maco/tip/QueryPhraseTest.php:20
FAILURES! Tests: 3, Assertions: 12, Failures: 2.
Anyway, towards a solution I would be inclined to fortify the web UI from https://github.com/digitalmethodsinitiative/dmi-tcat/blob/85504ded7fc9ad0274f8e977d23e80e2961d8ac3/capture/index.php#L488-L493
to something like
if(params['type']=='track') {
var _nrOfPhrases = validateNumberOfPhrases(params['oldphrases'].split(",").length,_newphrases.split(",").length);
if(!_nrOfPhrases) {
alert("With this query you will exceed the number of allowed queries (400) to the Twitter API. Please reduce the number of phrases.");
return false;
}
var _lenOfPhrases = validateLengthOfPhrases(_newphrases);
if(!_lenOfPhrases) {
alert("One or more of the phrases exceeds the allowed length (60 characters) to the Twitter API. Please shorten your phrases.");
return false;
with validateLengthOfPhrases(phrases)
being something like phrases.split(',').every(phrase => phrase.length <= 60)
. Some fortification further down the stack would be good too, I imagine, before phrases are accepted into the database.
Or alternatively fortify the web UI in here https://github.com/digitalmethodsinitiative/dmi-tcat/blob/85504ded7fc9ad0274f8e977d23e80e2961d8ac3/capture/index.php#L750
Hi @xmacex,
thanks for submitting this issue.
As you stated, the phrase does not make any sense as it would only track tweets which contain exactly "#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem".
TCAT should indeed validate the length of phrases, although it should not insert arbitrary comma's thus modifying the phrases input by the user. Instead, the user could get a warning indicating that the phrase to track is too long.
Best,
Erik
@ErikBorra I suggest let's keep the query design (Rogers 2017; and of course a whole body of literature in information retrieval) issues separate from input string validation. Of course as STS scholars this is a moment of reflection about how this divide exactly gets constantly made and re-made in the everyday practices 😆
Anyway I'd be happy to work on a pull request. I imagine input validation at JS as suggested in https://github.com/digitalmethodsinitiative/dmi-tcat/issues/334#issuecomment-425757821, but if someone can nudge me towards what would be a good place to introduce some fortification further down the stack, in I'll do that too while I'm on it. The function create_new_bin
https://github.com/digitalmethodsinitiative/dmi-tcat/blob/4613a1518afc9ef12b82f395a40d8b5871861404/capture/query_manager.php#L42 looks like a good candidate
Working on it here https://github.com/xmacex/dmi-tcat/tree/phrases_must_not_exceed_60chrs_each. I really don't know how to provide tests for Javascript functions buried in PHP 😟
A terminological observation about myself: I seem to have converged towars speaking of kitten,lizard,'large aardvark',anteater
as one query containing four phrases.
Hi @xmacex
Sorry for the late reply on this and thank you for your pull request. It looks sound in principle. Some quick remarks before I can merge this. Maybe we should add a maximum length to the @username
in the follow
role as well (which appears to be 15 characters). There is a third location where (individual) phrases are validated, which is validate_capture_phrases()
in common/functions.php
and is called by CLI scripts such as search.php
. Finally, could you detach your unit test from the pull request? It would leave a script in the tree which has non-present dependencies.
Thanks @dentoir. The Twitter API documentation for follow
does not seem to state that the lengths of usernames are limited as API query time (of course on new user registration, yes). Exceeding the 60 characters limit for a track, the Twitter API responds with a HTTP/1.1 406 Not Acceptable
. Does it give an error in the HTTP response, and freak out TCAT if asked to follow a user name too long? I am not sure if I can currently verify that myself.
I implemented validate_capture_phrases()
hopefully as suggested in commit https://github.com/digitalmethodsinitiative/dmi-tcat/pull/335/commits/df582eebe1bbb34e8c0d441cd3133a83d70af5a6, sorry I missed that. Thanks for pointing it out. If someone can take a quick review of the PR #335, I'll detach the test suite and nudge the pull request.
Hi. The following, long conjunctive query phrase
#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem
will make TCAT not restart after the routine reload which is performed shortly after a query bin is added, as observed inlogs/controller.php
I am waiting for the process
dmitcat_track.php
to respawn via the normal cronjob by monitoring its appearance inps afx |grep 'var/www/dmi-tcat' |grep "track"
listing. It won't be there. Removing the bin from the usual web UI brings thedmitcat_track.php
back, after the routine reload is done (in a dozen or so seconds).The query phrase is long, 142 characters. If I arbitrarily split it into two by placing a disjuctive comma in the middle
#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit, #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem
the same issue still persists.However, if I further split these into two disjunctions each,
#Chinabigbrother #Citizenrating, #Chinasurveillance #Socialcredit, #Chinablackmirror #Digitaldictatorship, #Digitalcensorship #Socialcreditsystem
, things work as usual. Of course any actual query design is out the window by this point (if it ever was there, but that's an another matter).I observed the unabridged query phrase go into the database. The bin gets it's entries in
tcat_query_bins_*
tables, as usual.I am not particularly expert in PHP / SQL webapp debugging, so if someone can please guide me how to get more information about the program flow, where to request debug messages, throw PHP exceptions and stack traces, or maybe even run the
dmitcat_track.php
under XDebug or another debugging tool, I would appreciate it.I hope I will be able the send a pull request once I have a better idea where to get more information what goes wrong.