Twitter code needs some performance improvements

davidnixon commented 4 years ago

Clicking the button to see tweets works but results are delayed by about 30 seconds in my local testing. I did a very, very light analysis and it looks like the NLU and the Tone Analyzer both take a bit more than ½ second each and with 25 tweets that adds up to about the 30 seconds.

Possible solutions:

Do not use Watson (Shey originally did this work with TextBlob instead)
Combine the 25 tweets into one request to TA and NLU
Something else (opinions welcome here)

Maybe the Watson docs have guidelines on performance?

@drealuc Do you have an opinion?

drealuc commented 4 years ago

Marek will provide recommendation to resolve performance issue.

blumareks commented 4 years ago

@drealuc @davidnixon my suggestion would be to use a fanning out pattern (that might be implemented with threads) and fanning in to gather the results. How I have done it?

I used a serverless (IBM Cloud Functions) to fetch all the IDs (or URLs of news/tweets).
having a list of couple dozens of those (~500 news) I inserted them in the bulk insert in the Cloudant DB to start Fanning Out process (based on serverless - you might want to execute it in threads instead)
there is a IBM Cloud Functions trigger "firing" on a change made in the database (an Insert of the ID in this case - also since I was doing it repeatedly, I was checking if the given URL is already there - so I wouldn't analyze a tweet/news twice or more times - Watson is costly in time and money).
the Cloudant Trigger invokes call to Watson NLU and makes the analysis insert in the other table / Cloudant DB - let's call it the Analysis DB
Fanning IN (gathering the results) - When the Watson process is done (assume 1-2 sec) - you can read all the respective fields of the Analysis DB with the "primary keys" from the initial list of News/Tweet IDs - and present the results of the analysis to the user.
This approach allows you to analyze only once a given tweet/news (it is expensive in time, and money to call Watson)
Using Serverless allows you to hand over the event-based task to the platform, that the vendor (IBM Cloud) takes care, and provides 99,99% availability despite the size of the scaling
Serverless has a major free tier (400,000 GB/s free calls available - that represents couple million calls free of charge per month).
VERY IMPORTANT ON THE PERFORMANCE: both the serverless action, and the Watson service(s) were on the same platform (IBM Cloud) cutting down the milliseconds for connectivity (usually it is in range of 200-300 ms lost on each call to Watson from outside which can accumulate in aggregated calls)

The details of my implementation with HackerNews are here for you to check them out: https://github.com/serverless-swift/ch6-app with the video of the implementation: https://github.com/serverless-swift/ch6-app The chapter 6 that talks about it in my :-) book is here (you might have an access to it via O'Reilly): https://learning.oreilly.com/library/view/serverless-swift-apache/9781484258361/ And finally the video showing me implementing these steps is here: step 1: https://youtu.be/0G3ji8RouKA step 2: https://youtu.be/FYolLFvIsSc We can have an additional call to go over it - or even I could help in adjusting my serverless backend to our needs if needed.

blumareks commented 4 years ago

@Shreyanand Shrey please have a look ^^ at the above explanations.

Shreyanand commented 4 years ago

@blumareks Thanks a lot for the detailed explanation. I went through the resources and if IIUC, there needs to be a database where the tweets are fetched into, a trigger that calls Watson NLU, and a fanning out process to collect the results for all the tweets in another database. In addition to the parallel processing for tweets, if Watson and the tweet database are on the same cloud server it saves even more time...

While it seems really interesting, and I'd love to talk to you about it to understand this more, for the immediate goal I think this would be a little difficult to implement.

Having said that, your multi-threading comment was a winner. I'm not sure if I interpreted it right, but I realized that there isn't any CPU task here and it's just IO calls to the API. So, I just gave naive multi-threading a try and it got the time from ~38s to ~2s.

Screenshot from 2020-10-07 16-30-17

Although it works, I want to confirm from you if this is acceptable and that It would not result in any other problem...

drealuc commented 4 years ago

Ready for unit testing and merge

sydrosa commented 4 years ago

This was merged into master -- I will close now.

Call-for-Code-for-Racial-Justice / Five-Fifths-Voter

Twitter code needs some performance improvements #31