JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.43k stars 706 forks source link

"since" filter does not work with TwitterHashtagScraper #124

Closed shizia closed 3 years ago

shizia commented 3 years ago

I used "since xxxx-xx-xx" to get tweets ,but it seems that it didn't work. The tweets that I got are still on October 16th. This is my code.

import snscrape.modules.twitter
for i, tweet in enumerate(snscrape.modules.twitter.TwitterHashtagScraper('Hashtag since:2019-07-01').get_items()): 
    print("NO",i+1,tweet.user.username,"||",tweet.content,"||",tweet.date,"||",tweet.user.location)
    if i >= 10:
        break

What's wrong with my approach?

JustAnotherArchivist commented 3 years ago

While that should currently work, TwitterHashtagScraper is not intended to be used that way. Use TwitterSearchScraper('#hashtag since:2019-07-01') instead.

However, Twitter's search is broken at the moment (#123), so until they fix that, everything will behave oddly.

shizia commented 3 years ago

I used TwitterSearchScraper('#hashtag since:2019-09-01').get_items()): but tweets that I got are still on October 16th.

JustAnotherArchivist commented 3 years ago

Yes, that's what the since filter does: it returns tweets from that date on. So your search produces all tweets that were posted on or after 1 September 2019. Did you mean to use until instead?

shizia commented 3 years ago

Maybe I had a wrong understanding of it. I thought that starting from 2019-09-01 means scrape from the earliest date, so I will get 10 tweets at that day. Yes ,maybe I should use until.

JustAnotherArchivist commented 3 years ago

To make it clear: since:YYYY-MM-DD = only return tweets newer than this date (inclusive); until:YYYY-MM-DD = only return tweets older than this date (exclusive). So a search for word since:2020-01-01 until:2020-02-01 fetches tweets from January 2020. All scrapes are sorted in reverse-chronological order, so the results will begin with tweets on 2020-01-31 and end with those on 2020-01-01.