JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.5k stars 712 forks source link

Weibo: "User does not exist" when using --name on certain accounts #444

Open TheTechRobo opened 2 years ago

TheTechRobo commented 2 years ago

Haven't tested with user IDs.

~/u/steam ❯❯❯ python3 parse_weibo.deduped >> weibo.jsonl
['snscrape', '--jsonl', '--progress', 'weibo-user', '--name', 'fangshengmeng']
2022-04-03 22:27:12.723  WARNING  snscrape.modules.weibo  User does not exist
Finished, 0 results
['snscrape', '--jsonl', '--progress', 'weibo-user', '--name', 'qukean']
2022-04-03 22:27:14.052  WARNING  snscrape.modules.weibo  User does not exist
Finished, 0 results
['snscrape', '--jsonl', '--progress', 'weibo-user', '--name', 'yetuavg']
2022-04-03 22:27:15.280  WARNING  snscrape.modules.weibo  User does not exist
Finished, 0 results
['snscrape', '--jsonl', '--progress', 'weibo-user', '--name', 'zhangfrank110']
2022-04-03 22:27:16.472  WARNING  snscrape.modules.weibo  User does not exist
Finished, 0 results

With verbose output (can't get locals because it didn,t crash; you should add an option to dump them anyway):

~/u/steam ❯❯❯ snscrape -v --progress --jsonl weibo-user --name fangshengmeng  2
2022-04-03 22:29:29.953  INFO  snscrape.base  Retrieving https://m.weibo.cn/n/fangshengmeng
2022-04-03 22:29:31.017  INFO  snscrape.base  Retrieved https://m.weibo.cn/n/fangshengmeng: 200
2022-04-03 22:29:31.017  WARNING  snscrape.modules.weibo  User does not exist
2022-04-03 22:29:31.017  INFO  snscrape._cli  Done, found 0 results
Finished, 0 results

Also it seems really unintuitive to have to add --name as an option if it's not a user ID; could this be fixed like it was with the Twitter scraper, i.e. seeing if it's an int?

JustAnotherArchivist commented 2 years ago

Can reproduce that with those names, but as far as I can tell, none of them exist (or their profiles require logging in, perhaps?). Others work correctly. Random example: Angelinazhaoooo (though it crashes with a KeyError on the video extraction very quickly).

The Twitter scraper also has an explicit flag, --user-id. Automatic detection for that obviously breaks when someone has a username composed solely of digits.

JustAnotherArchivist commented 2 years ago

Also, to dump on every WARNING or higher, there is a global option: --dump-locals (Yes, it should probably get a better name.)

TheTechRobo commented 2 years ago

The Twitter scraper also has an explicit flag, --user-id. Automatic detection for that obviously breaks when someone has a username composed solely of digits.

Oh, I guess that's true.

TheTechRobo commented 2 years ago

I could have sworn they existed when I loaded it up into a browser, but maybe I'm wrong. Sorry for opening this invalid issue, I guess.

Wait, https://weibo.com/qukean exists I think

JustAnotherArchivist commented 2 years ago

Maybe, but that's behind a login wall. The mobile site, which is publicly accessible and therefore used by snscrape, says it doesn't exist: https://m.weibo.cn/n/qukean

TheTechRobo commented 2 years ago

Oh, I'm using weibo.com. Is that different? I don't have to login for weibo.com/qukean:

image

JustAnotherArchivist commented 2 years ago

Yeah, it's not really a login, but it's an auth system of sorts with awful JS stuff to get cookies for accessing weibo.com (that I didn't want to reimplement). It is the same service though, so it's interesting that this profile is only accessible on weibo.com but not on m.weibo.cn.

TheTechRobo commented 2 years ago

Oh yeah, I noticed that redirect. Sounds very annoying to bypass or mimic.

JustAnotherArchivist commented 2 years ago

Yeah, the only way to fix this would be to reimplement that auth flow. Not something I'll tackle anytime soon, I think.

JustAnotherArchivist commented 2 years ago

It's only the name resolution which is the problem here, it seems. qukean is user ID 1223717857, and that works fine on the mobile site (and consequently with snscrape). The name resolution on weibo.com is still behind the same auth flow though, so this insight doesn't really change anything, but at least you can manually work around it by observing the user ID in the network monitor when loading the profile page and then using that.