geoffjentry / twitteR

R based twitter client
http://cran.r-project.org/web/packages/twitteR/index.html
254 stars 164 forks source link

ID exceeds R number precision #119

Closed joergsteinkamp closed 8 years ago

joergsteinkamp commented 8 years ago

There are a few very long UIDs, which exceed the number precision of the standard R types. E.g. "@HungryTrees" should has ID 725300176525598721 but gets 725300176525598720 in my list. I checked it at https://tweeterid.com/ Here is a example script, which shows the wrong IDs: library(twitteR)

my.name <- "JoergSteinkamp"

keys <- c(consumer.key="", consumer.secret="", access.token="", access.token.secret="")

setup_twitter_oauth(keys[1], keys[2], keys[3], keys[4])

me <- getUser(my.name) followers <- me$getFollowers() followerIDs <- me$getFollowerIDs()

message(sprintf("followers and followerIDs differ in length: %d %d", length(followers), length(followerIDs))) message(paste("Missing/wrong: '", paste(setdiff(followerIDs, twListToDF(followers)$id), collapse="', '"), "'", sep=""))

optykali commented 8 years ago

try

options(digits=20)

joergsteinkamp commented 8 years ago

That didn't help, I already tried it. options(digits=X) only increases the number of digits printed and not the precision of the datatype numeric.

optykali commented 8 years ago

Quick question: I get 403 response when using IDs exceeding double precision. When I hardcode the ID as such I don't. Did you experience the same problem. For replication purposes.

a file with just the last ID I found.

max.id <- read.table(paste0(basefolder,"max_id.txt"))[1,1]

max.id in file: 732852547225047041 max.id after reading to R: 732852547225047040 searchTwitter(queryterm, n=1000, sinceID = max.id) => 403 error searchTwitter(queryterm, n=1000, sinceID = as.numeric(max.id)) => 403 error searchTwitter(queryterm, n=1000, sinceID = as.character(max.id)) => 403 error searchTwitter(queryterm, n=1000, sinceID = "732852547225047040") => 403 error (!!!!)

searchTwitter(queryterm, n=1000, sinceID = "732852547225047041") => works

my temporary workaround: max.id <- read.table(paste0(basefolder,"max_id.txt"),colClasses=c("character"))[1,1] searchTwitter(queryterm, n=1000, sinceID = max.id) => works So basically treat it via as.character() helps.

joergsteinkamp commented 8 years ago

Yes, that's it. However, the as.character() workaround doesn't work with getUser(): If you pass a character to getUser it interprets it as screenName and not ID even if it consists only of numbers.

On 05/18/2016 11:03 AM, optykali wrote:

Quick question: I get 403 response when using such IDs. When I hardcode the ID as such I don't. Did you experience the same problem. For replication purposes.

a file with just the last ID I found.

max.id <- read.table(paste0(basefolder,"max_id.txt"))[1,1]

max.id in file: 732852547225047041 max.id after reading to R: 732852547225047040 searchTwitter(queryterm, n=1000, sinceID = max.id) => 403 error searchTwitter(queryterm, n=1000, sinceID = as.numeric(max.id)) => 403 error searchTwitter(queryterm, n=1000, sinceID = "732852547225047040") => 403 error (!!!!)

searchTwitter(queryterm, n=1000, sinceID = "732852547225047041") => works

my temporary workaround: max.id <- read.table(paste0(basefolder,"max_id.txt"),colClasses=c("character"))[1,1] searchTwitter(queryterm, n=1000, sinceID = max.id) => works So basically treat it via as.character() helps.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/geoffjentry/twitteR/issues/119#issuecomment-219967628

optykali commented 8 years ago

Workaround: screenNames should be unique. So I would use the screenname for queries. BTDT. Works for me.

joergsteinkamp commented 8 years ago

Yes, if I get the screenName! e.g. getFollowers() omits those and getFollowerIDs() returns the wrong IDs:

me <- getUser("JoergSteinkamp")
followers <- me$getFollowers()
followers.df <- twListToDF(followers)
followerIDs <- me$getFollowerIDs()

print(paste("number of followers by ID: ", length(followerIDs), "; by twList: ", nrow(followers.df), sep=""))
## [1] "number of followers by ID: 134, by twList: 131"

print(setdiff(followerIDs, followers.df$id))
## [1] "725300176525598720" "710593895739072512" "714812250276671488"

Those IDs derived by getFollowerIDs are wrong, they should be 725300176525598721 (HungryTres) 710593895739072513 (BinaryOptions60) 714812250276671489 (CENunihh) which is always 1 more, than the derived IDs by getFollowerIDs()

Can you reproduce that?

chlorenz commented 8 years ago

As far as I know screenNames are unique but can change, which might potentially be an issue.

@joergsteinkamp It seems like getUser() calls parseUsers() which uses as.numeric() to differentiate whether a screenName or a uid has been given as an argument. My understanding is that this generally fails for uids bigger than 9007199254740992 (see http://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double), because:

> 2^53
[1] 9007199254740992
> 2^53 + 1
[1] 9007199254740992

Here's a hacky workaround that bypasses this and behaves similar to getUser(), please use at your own risk:

getUser2 <- function(userIdChar, ...) {
  params <- buildUserList(userIdChar, NULL)
  buildUser(twInterfaceObj$doAPICall(paste('users', 'show', sep='/'), params=params, ...))
}

environment(getUser2) <- asNamespace('twitteR')
> getUser2("725300176525598721")
[1] "HungryTrees"