TumblThreeApp / TumblThree

A Tumblr and Twitter Blog Backup Application
https://TumblThreeApp.github.io
MIT License
608 stars 73 forks source link

Twitter API potentially locking age restricted content without the use of an account #221

Closed Technotron21 closed 2 years ago

Technotron21 commented 2 years ago

Greetings. I've noticed a bizarre thing when I crawled through Twitter blogs today. Though a majority of what I added to the queue downloaded, a couple of blogs that I know are still active didn't get anything new from the last time I did a crawl. I removed and readded a blog that was effected by this, and it proceeded to download not even a fraction of what the folder originally had (127 vs 2,833 images). I checked the blog on desktop Twitter itself, and I noticed a whole lot of posts that I used to be able see fine (once I clicked on it) that now require I need to be logged in to view them. Combine this with NSFW posts not embedding on Discord, and you get what seems like an attempt by the website to sweep such posts under the rug without actually scrubbing any of it. Could there be any solutions to bypass this with the crawler? Thanks.

thomas694 commented 2 years ago

Hello, I haven't noticed this yet, but I've only few blogs. If you have a small one with not too much posts that fails and it's ok for you, you can directly email (see profile) the url to me (not here nor reply to notification mail). Then I could have a look at it the next days.

T-prog3 commented 2 years ago

I can confirm this issue. Twitter seems to have two kinds of NSFW filters: One that hides potentially sensitive profiles and posts behind a confirm button.

Profile: profile

Post: post

They also have implemented a stronger filter on confirmed sensitive content that needs login:

age-restricted

Technotron21 commented 2 years ago

Indeed. Since the introduction of the stronger filter, I haven't seen the previous filter be in use at all. Which almost makes the "view profile" thing useless, since it would mainly be there for those without an account (though likely works better when it's an enabled thing while logged in).

thomas694 commented 2 years ago

Thank you both for the example and the screenshots.

With the old "sensitive content" warnings the post's content was still publicly accessible, but with these new "age-restricted content" blockers the post's content isn't accessible at all, the only thing you can do is log in.

I assume there will be no quick and easy solution to download these kind of blogs with TumblThree. Maybe we can find a way to do a login like currently for Tumblr with the authentication window and then use the cookies to download the blog like before. I'm not sure if that could work here. If someone has more in-depth knowledge, they are more than welcome to help us.

Technotron21 commented 2 years ago

So I decided to recently check the program after a bit of non-use, and miraculously everything just started downloading once again. Not really sure what did it, but as far as I can tell, the Twitter crawler is now functioning like it did prior to the website age restriction changes. There still seems to be some quirks with the API on other apps though (one only downloading presumably non-flagged gifs, but another handling all I threw at it fine), so there might still be stuff to look out for.

Technotron21 commented 2 years ago

Okay, checking again now, and it's back to having exactly the same issues. One blog flat out refusing to download, and another very blatantly missing posts. I suspect Twitter is really messing around with the API further (the buyout likely contributing to that), hence the inconsistent results. Given the recent trends of checking for shadow bans on touchier accounts, perhaps that might also have an effect here.

I was gonna mention earlier that gallery-dl was my initial means of getting Twitter posts, so maybe they might be good folk to ask about this. It's certainly still accessible through gif/video downloading apps, so I trust there's some workaround.

Edit: BTW, the original "show sensitive content" disclaimer I mentioned earlier seems to be in affect again, but for posts that the user personally flags as containing restricted material. So that's probably something else to keep track of (in case it applies to this issue in some way).

Edit 2: So I decided to check gallery-dl just now, and that now refuses to download anything from key blogs at all (including some that this program could partially rip). This is seriously getting messy.

thomas694 commented 2 years ago

It's good you're watching it and giving us additional info.

Technotron21 commented 2 years ago

Okay, I think I might've figured out what the issue was: it seems to refuse to download the restricted posts if you're outside the US.

I've been using a VPN for most of the time I've had the app open, which didn't really affect anything for a while. So it did come off strange to me that it managed to work one day, but then not the next attempt after. So I disable it for a moment, and lo and behold: it's now working again. I guess what the API has an issue with is non-North American connections grabbing the material (at least as far Canada goes; I haven't tried anywhere else).

While this does fix the issue for some, it probably opened a new can of worms by the possibility of geo-blocking being involved. Of course, it's very likely that the issue could just be that the Twitter crawler merely doesn't like VPNs, but it's hard to be certain without some proper tests being done. I don't know if this could be considered a case closed, so you be the judge of that.

thomas694 commented 2 years ago

It's good you found it out and the problem is somehow solved for you. From TumblThree side we are doing the requests as always. The problem, I guess, is because Twitter recognizes that the requests are coming from a VPN provider's network.

It's not a problem, if the issue stays open a few more days. Just I think there won't be much more new information except someone sharing with us if he has problems or not from another country or VPN provider.

But I think there's not much we can do resp. have to do as the normal download is working.

Technotron21 commented 2 years ago

Once more, it seems I might've posted a bit too soon. I noticed yesterday when restarting my Twitter collection that Jdownloader2 (which I had open) actually picked up on a couple of blogs as fully age restricted and couldn't be grab anything as is. And sure enough: some of the ones it listed as flagged only grabbed only a fraction of the posts it was able to before with T3. It seems that there's a fundamental difference between blogs that primarily have flagged posts, versus those that are flagged themselves. Not sure if I'm really explaining it well, but that's how I'm seeing it.

That being said: jd2 does actually allow you to download from these blogs once you do a log in through their cookies method, which does seem to confirm that particular method might be viable here. Otherwise: I'm not sure why else it seems to download so little from users that have much more posts that even the program itself is aware of. I think that's a different issue that was mentioned elsewhere, but I haven't checked it in a bit.

T-prog3 commented 2 years ago

As i did mention before, Twitter have two types of NSFW filters. The first could be regard as a warning message. You will recognize it with a View button that can be clicked to watch the profile or content that was hidden.

Then you have the second type of filter that doesn't provide a View button. These are more strict and the only way to show these profiles or content is to login and turn on the Display media that may contain sensitive content setting that can be found in Settings -> Privacy and safety -> Content you see. After this is done you'll be able to watch all of the hidden content when logged in.

At the moment TumblThree cannot download the second type at all because of the account setting flag that is required. Users that are marked as Type 2 and those that only allows followers to see their content will not be downloaded when added. This also applies to any content users have that is marked with the type 2 filter and will be the reason why lots of files will be missing when you download.

The only solution to this problem is to implement authentication to Twitter and manually change the setting on your account.

thomas694 commented 2 years ago

Thanks to both of you for giving more details.

Technotron21 commented 2 years ago

I think the issue has now accelerated to a point where more drastic measures need to be taken.

I opened up the program just now after it functioning completely fine yesterday, and now I'm suddenly seeing a lot of blogs in my list being flagged as being offline. I was about to open a new bug report, having assumed this applied to all things Twitter, but sure enough every affected account is one that has NSFW content as a major feature of them.

So this has now jumped from being restricted in certain territories to being completely blocked off without use of an account. Given what I previously mentioned about jdownloader2, I think it's becoming apparent that some form of sign in authentication is really needed at this point if the website is gonna alter the API this much. I don't know how practical it is currently with how the program is designed, but I figure we try while it's at this level.

(BTW, I had realized right after my last post a different reason for why some blogs didn't seem to grab as much it used to. It mainly comes down to the retweet system, since there seems to be a limit over how many you can have before it resets and removes earlier ones. Obviously nothing one can do about that, since it's a regular function of how the site operates (if one that is very much an open secret of sorts). Also, given accounts can be deleted, that probably would explain the odd missing post here and there as well (giving further drive to want this issue addressed, cause I'm kind of anticipating Tumblr 2.0 at this rate).)

EDIT: Given there's a Cloudflare outage currently, I wonder if this is related to that (as unlikely as it is with Twitter itself still being accessible). Also, there's a solid possibility that users being shadowbanned (plus whatever else that one website detects) could be at play here.