DanielaSfregola / twitter4s

An asynchronous non-blocking Scala client for both the Twitter Rest and Streaming API
Apache License 2.0
256 stars 100 forks source link

Support for full-archive / academic research track endpoints #366

Open TheConner opened 3 years ago

TheConner commented 3 years ago

Hello :smiley: this library has been great to use, it's been incredibly useful with the research work I have been doing. Currently, if my understanding is correct this only supports the "standard" search endpoints search/tweets.json

After doing some digging I've found that this endpoint won't offer the amount of data needed for the research I'm doing, thus I would need something like this endpoint for full-archive search which seems to only be offered under the v2 of the twitter API, /2/tweets/search/all (which is only for the academic research product track, interestingly)

I'm not sure if that is something this project supports, in the case that it's not supported yet I'm assuming there's a fair amount of technical debt in order to make this library compatible?

Since most of my research work depends on this Scala library, I'd be happy to contribute if need be :)

DanielaSfregola commented 3 years ago

Hi @TheConner, I am glad this library has been useful!

Unfortunately, we do not support any v2 endpoint yet (simply due to lack of time) - but this shouldn't be too hard to do if you want to give it a try! PRs are always welcome -- ( and happy to advise/support you in the process).

The authentication seems to be the same, although it seems like you need a token that is approved for Academic Research (which I do not have, but I am going to assume you do)...so that is something that you shouldn't worry about.

My suggestion is to create a new trait for the Rest client (similar to any of the ones listed here) in which you define the shape of your endpoint and the params to pass to it.

Cheers, Daniela

TheConner commented 3 years ago

Hi @DanielaSfregola Thanks for the helpful response! I've found some time to get started on this, luckily I only need to implement a few endpoints. I've done some work to build this, but I do have two questions:

  1. It looks like in order to make this work I need to add bearer token support (from what I read in http.clients.authentication, there does not seem to be bearer token support) which only requires one header (for curl users, -H "Authorization: Bearer $BEARER_TOKEN"). I can add this in and do the plumbing needed to make that work; however, it looks like everything in here is built around oauth, so I'm unsure how I should go about integrating this with my changes. Any advice on doing this would be appreciated!
  2. Once I'm done the plumbing work for (1), I can get to testing the endpoint I've implemented. Should I just add new tests for all of my changes? If I do implement tests they will be dependent on a bearer token that I have that was given to me by my university, and I don't think they would be too happy if I shared it (I notice some hard-coded keys in the unit tests). With this in mind, how should I go about testing this & integrating those tests with the rest of your codebase?

Thank you for the guidance! I'm new to the Scala way of doing things, so I may be missing out on some things that are just common sense to more seasoned Scala developers.

Regards, Conner

Edit: I think I found some answers to my bearer authentication issues in issue #237 - going to investigate the changes there :)

DanielaSfregola commented 3 years ago
  1. The issue #237 is definitively a good starting point. You can even create a new client in which you assume that in initialization the user will give you the bearer token to use.
  2. Testing must be there, but as you said, we cannot put real tokens in it. In the current library, we completely stub (simulate the behaviour) of the twitter API by parsing json files -- there are plenty of examples on how to do that. But we can discuss this after you have a working-ish implementation.

Thanks for helping with the library! Do not worry if you are new to Scala, I will help you with that ;)

DanielaSfregola commented 3 years ago

Also, you do not need to have a PR ready to ask for help - if you get stuck just ping me a branch and I can advise/help :)

TheConner commented 3 years ago

Added a bearer token client in my fork - moving on to integrating it with full archive search. I looked through the tests and I couldn't find any for http.clients.OAuthClient nor http.clients.Client, so I'm unsure how I should test this new bearertokenclient. Let me know if there's any changes I can make so it's up to snuff before a PR

TheConner commented 3 years ago

A question: with this new BearerTokenClient, I think it can be used in RestClient; however, it looks like that main rest client based off of Client which is a OAuthClient... So I'm thinking I could decouple Client from OAuthClient, such that we can pass in some generic provider for auth (i.e, oauth, bearer token ,etc), which will allow RestClient to dynamically pick a provider that could be derived from configuration. But this will require a bit of plumbing, and there's some magic going on inside there that I'm not familiar with. So if you have any ideas as to how to do this, I'm all ears :smile:

DanielaSfregola commented 3 years ago

We do not have tests just for the OAuthClient as far as I can see (but we do have tests for OAuth1Provider!) - but I think we could add them.

Decoupling Client and OAuthClient I think is the right thing to do. Ideally, we could have people use:

TwitterRestClient(consumerToken, accessToken) // using current OAuth
TwitterRestClient(bearerToken) // using OAuth for bearer token
TwitterRestClient() // check the env variables and pick a OAuth strategy accordingly -- with a preference to the current OAuth? Or maybe we fail if we have env variables for both? Not sure yet, we can figure it out
TheConner commented 3 years ago

Initial decoupling done here with test case here. I'm having issues building twitter4s for use with my own application, so I'm having a hard time verifying that this works outside of the test cases... Lots of dependency issues when I add the twitter4s jar with my existing application

TheConner commented 3 years ago

Update: there is a bug in my initial implementation, although figuring out why this is happening is a tad cryptic. Oddly enough all the tests pass, but when I use twitter4s within my application I get

Exception in thread "main" java.lang.AbstractMethodError: Receiver class com.danielasfregola.twitter4s.http.clients.rest.RestClient does not define or inherit an implementation of the resolved method 'abstract void de$heikoseeberger$akkahttpjson4s$Json4sSupport$_setter_$de$heikoseeberger$akkahttpjson4s$Json4sSupport$$jsonSourceStringMarshaller_$eq(akka.http.scaladsl.marshalling.Marshaller)' of interface de.heikoseeberger.akkahttpjson4s.Json4sSupport.
    at de.heikoseeberger.akkahttpjson4s.Json4sSupport.$init$(Json4sSupport.scala:96)
    at com.danielasfregola.twitter4s.http.clients.rest.RestClient.<init>(RestClient.scala:17)
    at com.danielasfregola.twitter4s.TwitterRestClient.<init>(TwitterRestClient.scala:41)
    at com.danielasfregola.twitter4s.TwitterRestClient$.apply(TwitterRestClient.scala:91)
    at com.danielasfregola.twitter4s.TwitterRestClient$.apply(TwitterRestClient.scala:75)
    at ca.advtech.ar2t.data.TweetIngest.<init>(TweetIngest.scala:28)
    at ca.advtech.ar2t.main$.main(main.scala:69)
    at ca.advtech.ar2t.TestRunMain$.main(TestRunMain.scala:5)
    at ca.advtech.ar2t.TestRunMain.main(TestRunMain.scala)

Continuing to investigate...

DanielaSfregola commented 3 years ago

Something to do with JSON support when initializing the RestClient.... (I did a quick look at the code - just looking, didn't check it out -- and didn't see anything obvious)

TheConner commented 3 years ago

It was due to a dependency issue, my application uses a different scala version (Apache spark is behind a few versions) so I had to build twitter4s for a different scala version & some of the dependency changes I made on my end didn't work. Also, building twitter4s as a jar and importing that jar in meant that I had to manually include dependencies

Anywho, after that and some other dependency pains I managed to fix the issues in my implementation, so now I'm sucessfully using twitter4s using bearer-token auth with the full-archive search endpoint!

PR is made, let me know if there's any changes you would like me to make to it :)