Fix TwitterProfileScraper cycles - escape after duplicates threshold

JustAnotherArchivist / snscrape

A social networking service scraper in Python

GNU General Public License v3.0

4.51k stars 712 forks source link

Fix TwitterProfileScraper cycles - escape after duplicates threshold #952

Closed novucs closed 1 year ago

novucs commented 1 year ago

Attempts to solve: https://github.com/JustAnotherArchivist/snscrape/issues/938

JustAnotherArchivist commented 1 year ago

Thanks, and valid approach. I had a different idea: check whether the page of results returned by Twitter contains the same tweets as the previous page with results. That's simpler code and only requires keeping the tweet IDs of one page rather than accumulating a set of everything. I'm also not entirely sure whether the new profile page (#937) never returns tweets multiple times as part of conversation items, which could lead to false positives with this dupe checking approach.

novucs commented 1 year ago

Thanks, and valid approach. I had a different idea: check whether the page of results returned by Twitter contains the same tweets as the previous page with results. That's simpler code and only requires keeping the tweet IDs of one page rather than accumulating a set of everything. I'm also not entirely sure whether the new profile page (#937) never returns tweets multiple times as part of conversation items, which could lead to false positives with this dupe checking approach.

Good call on this potentially firing off false positives by looking for just tweet ids. I've adjusted this PR to look for "duplicated pages" rather than "duplicate twitter ids". Unfortunately we'll still need to capture the entire set of pages for this to work. However from my experience this'll only require keeping max ~3.5k integers as the profile scraper has quite a limited history, not infinitely unbounded.

JustAnotherArchivist commented 1 year ago

Ah... So previously, I thought Twitter was always returning a non-empty page, some positive number of empty pages, then the same non-empty page again. Then storing just the last page's tweet IDs would be sufficient. But I actually just came across an example where it looped on two pages, i.e. page1, page2, empty, page1, page2, empty, ..., and another one where it redid at least a dozen pages. It would still be possible to avoid storing all tweet IDs, but the logic for it would be much more complicated, so I guess that's not worth it. If it becomes an issue of too high RAM usage, can limit the number of pages stored in the future.

I'll have some comments in an hour or two. In the meantime, could you resolve the conflict from the fix for #951 please?

novucs commented 1 year ago

Ah... So previously, I thought Twitter was always returning a non-empty page, some positive number of empty pages, then the same non-empty page again. Then storing just the last page's tweet IDs would be sufficient. But I actually just came across an example where it looped on two pages, i.e. page1, page2, empty, page1, page2, empty, ..., and another one where it redid at least a dozen pages. It would still be possible to avoid storing all tweet IDs, but the logic for it would be much more complicated, so I guess that's not worth it. If it becomes an issue of too high RAM usage, can limit the number of pages stored in the future.

I'll have some comments in an hour or two. In the meantime, could you resolve the conflict from the fix for #951 please?

Resolved conflicts

JustAnotherArchivist commented 1 year ago

Fixed the style issues myself as I'd like to get this merged and released. Thanks for the PR!