cjbarrie / academictwitteR

Repo for academictwitteR package to query the Twitter Academic Research Product Track v2 API endpoint.
Other
272 stars 59 forks source link

extraction of $places from jsons #175

Open jcs82 opened 3 years ago

jcs82 commented 3 years ago

When running get_all_tweets and specifying "has:geo" in the search string, place_ids are returned in the json, but not the fields grouped under "place.fields" ("contained_within,country,country_code,full_name,geo,id,name,place_type"), though these are included in get_all_tweets.R.

To Reproduce Any standard request including "has:geo"

Expected behavior To have "place.fields" inlcuded in the resulting .json as requested in get_all_tweets.R (line:93)

Session Info: R version 4.0.5 (2021-03-31) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 10.16

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] devtools_2.4.2 usethis_2.0.1 data.table_1.14.0 jsonlite_1.7.2
[5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 readr_1.4.0
[9] tidyr_1.1.3 tibble_3.1.2 ggplot2_3.3.4 tidyverse_1.3.1
[13] purrr_0.3.4 academictwitteR_0.2.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.6 lubridate_1.7.10 prettyunits_1.1.1 ps_1.6.0 assertthat_0.2.1 rprojroot_2.0.2
[7] utf8_1.2.1 R6_2.5.0 cellranger_1.1.0 backports_1.2.1 reprex_2.0.0 httr_1.4.2
[13] pillar_1.6.1 rlang_0.4.11 curl_4.3.2 readxl_1.3.1 rstudioapi_0.13 callr_3.7.0
[19] desc_1.3.0 munsell_0.5.0 broom_0.7.7 compiler_4.0.5 modelr_0.1.8 pkgconfig_2.0.3
[25] pkgbuild_1.2.0 tidyselect_1.1.1 fansi_0.5.0 crayon_1.4.1 dbplyr_2.1.1 withr_2.4.2
[31] grid_4.0.5 gtable_0.3.0 lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1 scales_1.1.1
[37] stringi_1.6.2 cli_2.5.0 cachem_1.0.5 fs_1.5.0 remotes_2.4.0 testthat_3.0.3
[43] xml2_1.3.2 ellipsis_0.3.2 generics_0.1.0 vctrs_0.3.8 tools_4.0.5 glue_1.4.2
[49] hms_1.1.0 processx_3.5.2 pkgload_1.2.1 fastmap_1.1.0 colorspace_2.0-1 sessioninfo_1.1.1 [55] rvest_1.0.0 memoise_2.0.0 haven_2.4.1

cjbarrie commented 3 years ago

Would be helpful to have a reprex here if we are to test this. Though would say that position of returned place_id depends on how payload is returned by Twitter. If you want to flatten and tidy the data, you could try the bind_tweets tidy tibble formats in the dev. version

jcs82 commented 3 years ago

ac-tw_pdit_share.R.zip

jcs82 commented 3 years ago

This (above) is the script that I am running. It loops through a number of generated expressions. I am flattening the json with the bind_tweets function. I am getting place_id for all tweets, but not the place names!

cjbarrie commented 3 years ago

Please consult other Issues to see how to share code. Developers will be reluctant to open a zipped file from online source. Still not sure what the Issue is here. But note that place names are recorded in the users_* json files. See https://github.com/cjbarrie/academictwitteR/discussions/45#discussioncomment-675541 for Discussion of this point

jcs82 commented 3 years ago

Ah, apologies - I tried to share R file, but not permitted file type. You can reproduce what I am doing with:

####################

Build query

query <- paste0('(','"stilton" OR "roquefort"',')', ' has:geo')

set path from type

path <- "enter path name"

get_all_tweets(query, start_tweets, end_tweets, bearer_token, file = NULL, data_path = path, export_query = TRUE, bind_tweets = FALSE, verbose = TRUE, n = 1000000, page_n = 500 )

resume_collection(path, bearer_token)

################## ##################

bind jsons

data<-bind_tweets(path,user = FALSE) users<-bind_tweets(path,user = TRUE)

#############################

The location field in the users json files corresponds to user entered location data. I am looking for the place names that Twitter provide that correspond to the place_id field. See: https://developer.twitter.com/en/docs/twitter-api/v1/geo/place-information/api-reference/get-geo-id-place_id

This shows how to request the information using the place_id - but it should be included in the original json if requested (which it appears to be in the get_all_tweets.R code)...

jcs82 commented 3 years ago

Further example for clarification: In my dataset, I have a data point with the 'place_id': 315b740b108481f6. The 'location' field that corresponds to this user is just "England", as it is simply what the user entered when they setup their account. However, if I run twurl to get location data from the place_id, I get all of this:

twurl /1.1/geo/id/315b740b108481f6.json {"id":"315b740b108481f6","name":"Manchester","full_name":"Manchester, England","country":"United Kingdom","country_code":"GB","url":"https://api.twitter.com/1.1/geo/id/315b740b108481f6.json","place_type":"city","attributes":{"geotagCount":"2049"},"bounding_box":{"type":"Polygon","coordinates":[[[-2.319934,53.343623],[-2.319934,53.5702824],[-2.147026,53.5702824],[-2.147026,53.343623],[-2.319934,53.343623]]]},"centroid":[-2.2071462643114526,53.4569527],"contained_within":[{"id":"0124f5325ea573e7","name":"Greater Manchester","full_name":"Greater Manchester","country":"United Kingdom","country_code":"GB","url":"https://api.twitter.com/1.1/geo/id/0124f5325ea573e7.json","place_type":"admin","attributes":{},"bounding_box":{"type":"Polygon","coordinates":[[[-2.730521,53.194762],[-2.730521,53.685734],[-1.653818,53.685734],[-1.653818,53.194762],[-2.730521,53.194762]]]},"centroid":[-2.1117764635900707,53.440248]}],"polylines":[],"geometry":null}

All of these fields should be available with the initial request (so long as expansions=geo.place_id and tweet.fields=geo, which appears to be the case already). So I am not sure why they are not being returned.

chainsawriot commented 3 years ago

Short answer: the data are there. But the mechanism to extract those geo data is yet to be developed. In the meantime, you can either use the place_id to join with the places data. (Or you can help us with the development.)

(I am running this with an IP from Germany and thus the place name is in German. "Nizza, Frankreich" is Nice in France. But you got the idea.)

require(academictwitteR)
#> Loading required package: academictwitteR

tmpdir <- academictwitteR:::.gen_random_dir()

query <- paste0('(','"stilton" OR "roquefort"',')', ' has:geo')

get_all_tweets(query,
               start_tweets = "2020-01-01T00:00:00Z",
               end_tweets = "2021-06-30T00:00:00Z",
               file = NULL,
               data_path = tmpdir,
               export_query = TRUE,
               bind_tweets = FALSE,
               verbose = TRUE,
               n = 1000,
               page_n = 500)
#> query:  ("stilton" OR "roquefort") has:geo 
#> Total pages queried: 1 (tweets captured this page: 495).
#> Total pages queried: 2 (tweets captured this page: 498).
#> Total pages queried: 3 (tweets captured this page: 499).
#> Total tweets captured now reach 1000 : finishing collection.
#> Data stored as JSONs: use bind_tweets function to bundle into data.frame

files <- list.files(tmpdir, pattern = "^users")
user_content <- jsonlite::read_json(file.path(tmpdir, files[1]))
places_content <- user_content$places
places_content[[1]]
#> $country_code
#> [1] "FR"
#> 
#> $geo
#> $geo$type
#> [1] "Feature"
#> 
#> $geo$bbox
#> $geo$bbox[[1]]
#> [1] 7.1821
#> 
#> $geo$bbox[[2]]
#> [1] 43.6453
#> 
#> $geo$bbox[[3]]
#> [1] 7.324
#> 
#> $geo$bbox[[4]]
#> [1] 43.7608
#> 
#> 
#> $geo$properties
#> named list()
#> 
#> 
#> $full_name
#> [1] "Nizza, Frankreich"
#> 
#> $id
#> [1] "23f8a07383ac617e"
#> 
#> $place_type
#> [1] "city"
#> 
#> $country
#> [1] "Frankreich"
#> 
#> $name
#> [1] "Nizza"

tweets <- bind_tweets(tmpdir)
#> ================================================================================
#place_id
tweets$geo$place_id[1]
#> [1] "23f8a07383ac617e"

unlink(tmpdir, recursive = TRUE)

Created on 2021-06-30 by the reprex package (v2.0.0)

jcs82 commented 3 years ago

Oh, Awesome. Thanks for this, I will definitely have a look at helping out with dev. Cheers, Jon

jcs82 commented 3 years ago

Hi, so the following code works for extracting places. It's taken from the 'bind_tweets.R' code. The only weird thing was that it threw out an error with "simplifyVector=TRUE". Setting to FALSE avoids that error, but means that it might be tricky to integrate into the function as it is (which has simplifyVector = T).

df.all <- data.frame() for (i in seq_along(files)){ file.name<-files[[i]] df<-read_json(file.name,simplifyVector = FALSE) df<-df$places df.all<-bind_rows(df.all, df) }

JohMast commented 3 years ago

Since I am very interested in working with geolocated tweets, I have also encountered the need to add the place information.

I have adjusted the get_tweets function to unnest the places object (turning into a dataframe, including the bounding box coordinates) and then left_bind it to the main tweets dataframe (df$data) via the place_id contained there.

Can you see any issue with this approach? It seems to work for me so far, but I can imagine it may cause inconsistencies with other functionalities of the academictwitteR package - since the df object is expanded.

jcs82 commented 3 years ago

Hi, yes - this sounds similar to what I ended up doing. I don’t think there should be any issue with what you describe here.

The only thing I did as well was to run the json extraction as a ‘foreach’ loop to use more cores. This sped things up considerably (roughly 2x increase just using 3 threads, would be faster still if you had more cores, of course.).

On 19 Aug 2021 at 15:00:13, JohMast @.***> wrote:

Since I am very interested in working with geolocated tweets, I have also encountered the need to add the place information.

I have adjusted the get_tweets function to unnest the places object (turning into a dataframe, including the bounding box coordinates) and then left_bind it to the main tweets dataframe (df$data) via the place_id contained there.

Can you see any issue with this approach? It seems to work for me so far, but I can imagine it may cause inconsistencies with other functionalities of the academictwitteR package - since the df object is expanded.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cjbarrie/academictwitteR/issues/175#issuecomment-901940289, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4SN4XP2DIM4ZQDP3HGLSLT5UE63ANCNFSM47P74DIQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

JohMast commented 3 years ago

Thanks for the tip! I will try that. Currently, I am binding the places to the tweets before the json files are even written. That simplifies the use of place information going forward. But it is obviously a lot less storage efficient, since the size of the tweets json increases substantially. For my purposes, I think this is an acceptable trade-off only because I am using the place information a lot at the moment, but as a general solution it is far from optimal.

jcs82 commented 3 years ago

Interesting you bound the tweets before jsons are written. That'd be better actually.... The places are there in the user jsons but nested and have to be extracted separately. Either way it does feel a little inefficient at the moment...

On Fri, 20 Aug 2021, 08:10 JohMast, @.***> wrote:

Thanks for the tip! I will try that. Currently, I am binding the places to the tweets before the json files are even written. That simplifies the use of place information going forward. But it is obviously a lot less storage efficient, since the size of the tweets json increases substantially. For my purposes, I think this is an acceptable trade-off only because I am using the place information a lot at the moment, but as a general solution it is far from optimal.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cjbarrie/academictwitteR/issues/175#issuecomment-902484632, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4SN4SHGCPVWCPP5EXUH2LT5X5XJANCNFSM47P74DIQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

UnknownAlienTechnologies commented 2 years ago

So has this been fixed? I need the Full_name location of tweets.

marisolmanfredi commented 2 years ago

HEEEELP !!

So Im using a code also for a place, but as it didn't recognize the name of the city (is in Argentina), I'm using the point_radius = c (-57.649074, -38.083938, 20) function to build my query.

However, when I use this query to get_all_tweets, for also an specific time period in the past, I only get 100 tweets... when Im sure there are more. R tells me: Total pages queried: 1 (tweets captured this page: 492). Total tweets captured now reach 100 : finishing collection.

am I doing something wrong? how can I get more tweets? I may have reached the limit as an academic researcher?

cjbarrie commented 2 years ago

Please provide code if you want a sensible response. My guess is that you have not changed the n = 100 default. Set to n = Inf to capture all tweets