ISG-ICS / cloudberry

Big Data Visualization
http://cloudberry.ics.uci.edu
91 stars 82 forks source link

Visualizing correlation of two keywords on Twittermap #548

Closed yangguah closed 6 years ago

yangguah commented 6 years ago

Spring quarter's work

First try--using correlation formula in terms of statistics

The work I was given is to show the relation of two keywords at a state level, county level and city level in the twittermap. By using the correlation formula, I transferred the relation of two keywords into a number in the range of [-1,1]. To be specific, the correlation formula is qq20180717-203131 qq20180717-203918 For example, in every state, I divide it into 8x8 girds. For each gird I could get the count of tweets where the first keyword appears and the count of tweets where the second keyword appears and then I applied the correlation formula.

However, this method did not make sense because it only showed the correlation of the distribution of the population in each region but not the correlation between the two keywords. The correlation we defined here actually is the possibility of two keywords being talked both at the same tweet.

yangguah commented 6 years ago

Spring quarter's work

Second try--using Jaccard function

Next, I changed to define the meaning of the relation of two keywords A and B as the possibility of when people refer A in a tweet they will also refer B in the same tweet and vice versa. And then I applied the Jaccard index as qq20180717-211044 qq20180717-211059 qq20180717-211305 This method shows the relationship of two keywords as a number in the range of [0,1], whick looks more sense than the method using correlation formula based on several results in my demo. For example, given two keywords "los" and "angeles", the state of Los Angeles will give a very high number of how they are related to each other. qq20180718-151932 qq20180718-152433 From the demo, we can see in the state level the count of tweets that contain "los" is 4768 and the count of tweets that contain "angeles" is 3898. And the relationship between them is a very high number of 0.8065.

yangguah commented 6 years ago

Summer's work

Why treat each user's all tweets in one place as one document by using the Jaccard function.

Since each tweet has very few words, a user may say "Kobe is great!" in a tweet and then say "I love Bryant" in another tweet. Our intuition actually wants to regard these two tweets as one tweet because in this way the number of tweets containing both "Kobe" and "Bryant" will be 1. If these tweets are separate, the result " 'Kobe' intersects 'Bryant' " will be 0, but we regard "Kobe" and "Bryant" are in one tweet. Beside, Qiushi helped check that backup server had 3 million distinct users among 50 million tweets, which means that each user has 16 tweets on average. Good news is that there is no unknown user in the database. qq20180717-215232 Therefore, it will be a good way to treat each user's tweets as just one documentation.

yangguah commented 6 years ago

Summer's work

How I implemented the task

  1. Create 6 queries. I create six different queries and each of them is sent to the Cloudberry based on the keyword and group tweets by state_id and user_id in state level or group them by county_id and user_id in county_level or group them by city_id and user_id in city level. For example, for the keyword "Los" and "Angeles", a query sent to the cloudberry based on the keyword "Los" and grouped by state_id and user_id would be like this: qq20180718-154756 I 'd like to explain the "limit :6000000\" in my query, if I did not add this line in the code, I could not run my demo and do not know why. While adding this line means if the number of results I get from the Cloudberry exceeds 6000000, the number of results will be cut off to 6000000. This does not matter since my database has 4289787 tweets (smaller than 6000000). Adding this line is to ensure that my demo can run. Also, I make a variable called "d_query" to store the six different results getting from the Cloudberry. qq20180718-165847

Here we send the query to the cloudberry and get the results after grouping tweets that contain "Los" by state_id and user_id. The results I get from the Cloudberry is a JSON and I parse it. qq20180718-161343

Then the transformed results become intelligible: qq20180718-160604 Here the result is actually an array containing many dictionaries. Each dictionary has a first key "geo_tag.stateID" corresponding to a state as a value and has a second key "user.id" corresponding to a user as a value. And in the state level, the second query we send to the cloudberry is based on the keyword "Angeles" and is grouped by state_id and user_id. qq20180718-162558

yangguah commented 6 years ago

Summer's work

How I implemented the task

  1. Process the data. After getting the results from Cloudberry, I managed to Load the results into the state level and county level. For example, based on the first keyword "Los", I store the number of users whose tweets containing "Los" in every state in a variable named "state_counts". qq20180718-164907 qq20180718-164940 Here I load the data into state layers and county layers. (I have not finished city layers yet).

For example, "state_counts[0]" contains the ids of users whose tweets containing the first keyword "Los" in each state.

qq20180718-171414

From the picture, you can see the state whose ID is 1 has a set containing 7 different user_ids.

yangguah commented 6 years ago

Summer's work

How I implemented the task

  1. Calculate the intersection of users by using Jaccard function. I'd like to give a simple example to explain what the intersection of users is. The intersection of users are those users whose tweets containing both the first keyword and the second keyword. Again, I suppose the first keyword is "Los" and the second keyword is "Angeles". For example, in the state of Los Angeles, there are 4 users whose tweets containing "Los" and these user_ids are in a set {1234,2345,3456,4567}. There are 3 users whose tweets containing "Angeles" and these user_ids are in a set {1234,7890,2345}. Then the number of users whose tweets both containing "Los" and "Angeles" is 2 and the users' ids are 1234 and 2345. Then I come up with a function to calculate the number of users whose tweets containing both two keywords in a certain region(in a state, county or a city). qq20180718-175055 Here for each region, we can get the number of users whose tweets containing both two keywords and we denote it as "the intersection of A and B". Then how can get the number of users whose keywords containing the first keyword or the second keyword (we denote it as "the union of A and B")? Here comes the formula: qq20180722-102750 Since the correlation of each region is defined as the number of users whose tweets containing both keywords divided by the number of users whose tweets containing the first keyword or the second keyword, we are able to calculate the correlation in each region. For example: qq20180722-103341 "state_counts[2][x]" is the intersection of users in state x, "(state_counts[0][x].size+state_counts[1][x].size-state_counts[2][x].size)" is the the union of users in state x and "corr[x]" is the correlation in the state x. We can use this method to calculate each correlation in state level, in county level and in city level.
yangguah commented 6 years ago

Summer's work

How I implemented the task

  1. Load a larger dataset and measure the query time. I have loaded the dataset that contains all the tweets from January 24th, 2017 to January 3rd,2018(one year's tweets). The number of tweets is 4289787. qq20180708-215542 Then for each query sent to the cloudberry, I checked the time of how long I get the results from cloudberry. For example: qq20180722-110358 "finished_count=1" means that I have sent the first query to cloudberry and got the results from the cloudberry. "t2" is the time when I start sending the first query to cloudberry. "T" is the time when I have got the results from cloudberry based on the first query. For keywords "los" and "angeles", the time each query takes is: qq20180722-111154 Six queries take around 23 seconds and each takes around 3.8 seconds, which is expensive. Right now my focus is on checking the sense of Jaccard function instead of the query time. Next I would try different combinations of keywords and analyze whether Jaccard function is proper to our correlation map.
yangguah commented 6 years ago

Try different combinations of keywords

Key1 Key2 Picture Description Analysis
New York qq20180724-211745 Only New York gives a very very high correlation number of 0.5688. The correlation number in others states is very small. It is around 0.025 in every state. That is because people in New York state like to talk about "New York" in their tweets. People actually like to talk about the place where they live
Los Angeles qq20180724-214850 California gives a correlation number of 0.7995 and the number of users talking about "Los" or "Angeles". The correlation number in other states is low. Some of them can reach to 0.25 and the number of users talking about "Los" or "Angeles" is much lower compared to the number of users in Los Angeles. The number of users talking about "Los" in Nevada is 53 and the number of users talking about "Angeles" is 12. The correlation number in Nevada is 0.2264. Following this trend, as the number of users talking about "Los" or "Angeles" increases a lot, the correlation number would be still around 0.25 since such two keywords are not much related to each other. But the correlation number in California is very high. Again, that is because people like to talk about their places where they live
New Mexico qq20180724-225904 The correlation number in New Mexico state is 0.5282. The correlation number in other state is very low and almost each is below 0.01. Though the count of users in California talking about "New" or "Mexico" is very big, the correlation number of the two keywords is very low(0.0082). To be specific, the count of users talking about "New" is 9488("new" is a very common word in the tweets) and the count of users talking about "Mexico" is 580 (California has a Mexican culture) and especially the count of users in California talking about "New" is much higher than the count of users in New Mexico talking about "New" (9488 vs 436), but the correlation number in California is much lower than the correlation number of New Mexico(0.0082 vs 0.5282 ). People in New Mexico talk about such two keywords much more frequently than people in California do. That is true since people like to talk about their places where they live. More importantly, this fact strengthens the sense of Jaccard function.
happy birthday qq20180801-101825 “Happy” and “Birthday” work well. Such two keywords are very related in each state but there is not much difference among states. Also, “Merry” and “Christmas” have the same version of “Happy” and “Birthday” since they are both greetings used commonly in daily life.
super bowl qq20180801-102309 This is a good one. Super bowl is a big game of American football. The correlation number varies in different states based on the popularity of Super bowl among states. The number correlation is highest in Minnesota because there was a hot topic that the Vikings in Minnesota won the NFC North for the second time in three years in 2017(a big news).

While given the "Trump" and "Clinton", the correlation map did not show any meaningful message. We neither get useful information from the correlation map given keywords "Apple" and "Android". Well, I decide to connect to our backserver to see if I can get a deeper meaning of this correlation map.

yangguah commented 6 years ago

summary

After connecting to our back-server, I still cannot come up with a pair of keywords that inspire people. The keywords are either related to each other or not related to each other and there is nothing more meaningful information. The sense of Jaccard function can be converted such a question that "If two keywords are related, can we get a deeper meaning from this?" Unfortunately, the Jaccard function does not make sense if we do not solve this question. Also, I doubt if we can make a meaningful correlation map unless we do not give an exact definition and a meaning. We might be able to create a map that shows the support rate of Trump and Hillary. For example, given a state, if people talk about more about Trump than Hillary, the state is marked by red and else the state is marked by blue. There is a shortcoming of this method, which is that the situation where people talk about Trump and excoriate him. Even though we create such a map that shows the support of people, we need more efforts to think about how we can make such a map. At least Jaccard function is not what we want since we cannot get much information or deeper meaning given two related objects.

chenlica commented 6 years ago

I am going to close this issue. The conclusion is: we haven't found a good meaning or use case of correlation map.

@yangguah : Thank you.