Closed yangguah closed 6 years ago
Next, I changed to define the meaning of the relation of two keywords A and B as the possibility of when people refer A in a tweet they will also refer B in the same tweet and vice versa. And then I applied the Jaccard index as This method shows the relationship of two keywords as a number in the range of [0,1], whick looks more sense than the method using correlation formula based on several results in my demo. For example, given two keywords "los" and "angeles", the state of Los Angeles will give a very high number of how they are related to each other. From the demo, we can see in the state level the count of tweets that contain "los" is 4768 and the count of tweets that contain "angeles" is 3898. And the relationship between them is a very high number of 0.8065.
Since each tweet has very few words, a user may say "Kobe is great!" in a tweet and then say "I love Bryant" in another tweet. Our intuition actually wants to regard these two tweets as one tweet because in this way the number of tweets containing both "Kobe" and "Bryant" will be 1. If these tweets are separate, the result " 'Kobe' intersects 'Bryant' " will be 0, but we regard "Kobe" and "Bryant" are in one tweet. Beside, Qiushi helped check that backup server had 3 million distinct users among 50 million tweets, which means that each user has 16 tweets on average. Good news is that there is no unknown user in the database. Therefore, it will be a good way to treat each user's tweets as just one documentation.
Here we send the query to the cloudberry and get the results after grouping tweets that contain "Los" by state_id and user_id. The results I get from the Cloudberry is a JSON and I parse it.
Then the transformed results become intelligible: Here the result is actually an array containing many dictionaries. Each dictionary has a first key "geo_tag.stateID" corresponding to a state as a value and has a second key "user.id" corresponding to a user as a value. And in the state level, the second query we send to the cloudberry is based on the keyword "Angeles" and is grouped by state_id and user_id.
For example, "state_counts[0]" contains the ids of users whose tweets containing the first keyword "Los" in each state.
From the picture, you can see the state whose ID is 1 has a set containing 7 different user_ids.
Key1 | Key2 | Picture | Description | Analysis |
---|---|---|---|---|
New | York | Only New York gives a very very high correlation number of 0.5688. The correlation number in others states is very small. It is around 0.025 in every state. | That is because people in New York state like to talk about "New York" in their tweets. People actually like to talk about the place where they live | |
Los | Angeles | California gives a correlation number of 0.7995 and the number of users talking about "Los" or "Angeles". The correlation number in other states is low. Some of them can reach to 0.25 and the number of users talking about "Los" or "Angeles" is much lower compared to the number of users in Los Angeles. | The number of users talking about "Los" in Nevada is 53 and the number of users talking about "Angeles" is 12. The correlation number in Nevada is 0.2264. Following this trend, as the number of users talking about "Los" or "Angeles" increases a lot, the correlation number would be still around 0.25 since such two keywords are not much related to each other. But the correlation number in California is very high. Again, that is because people like to talk about their places where they live | |
New | Mexico | The correlation number in New Mexico state is 0.5282. The correlation number in other state is very low and almost each is below 0.01. | Though the count of users in California talking about "New" or "Mexico" is very big, the correlation number of the two keywords is very low(0.0082). To be specific, the count of users talking about "New" is 9488("new" is a very common word in the tweets) and the count of users talking about "Mexico" is 580 (California has a Mexican culture) and especially the count of users in California talking about "New" is much higher than the count of users in New Mexico talking about "New" (9488 vs 436), but the correlation number in California is much lower than the correlation number of New Mexico(0.0082 vs 0.5282 ). People in New Mexico talk about such two keywords much more frequently than people in California do. That is true since people like to talk about their places where they live. More importantly, this fact strengthens the sense of Jaccard function. | |
happy | birthday | “Happy” and “Birthday” work well. | Such two keywords are very related in each state but there is not much difference among states. Also, “Merry” and “Christmas” have the same version of “Happy” and “Birthday” since they are both greetings used commonly in daily life. | |
super | bowl | This is a good one. | Super bowl is a big game of American football. The correlation number varies in different states based on the popularity of Super bowl among states. The number correlation is highest in Minnesota because there was a hot topic that the Vikings in Minnesota won the NFC North for the second time in three years in 2017(a big news). |
While given the "Trump" and "Clinton", the correlation map did not show any meaningful message. We neither get useful information from the correlation map given keywords "Apple" and "Android". Well, I decide to connect to our backserver to see if I can get a deeper meaning of this correlation map.
After connecting to our back-server, I still cannot come up with a pair of keywords that inspire people. The keywords are either related to each other or not related to each other and there is nothing more meaningful information. The sense of Jaccard function can be converted such a question that "If two keywords are related, can we get a deeper meaning from this?" Unfortunately, the Jaccard function does not make sense if we do not solve this question. Also, I doubt if we can make a meaningful correlation map unless we do not give an exact definition and a meaning. We might be able to create a map that shows the support rate of Trump and Hillary. For example, given a state, if people talk about more about Trump than Hillary, the state is marked by red and else the state is marked by blue. There is a shortcoming of this method, which is that the situation where people talk about Trump and excoriate him. Even though we create such a map that shows the support of people, we need more efforts to think about how we can make such a map. At least Jaccard function is not what we want since we cannot get much information or deeper meaning given two related objects.
I am going to close this issue. The conclusion is: we haven't found a good meaning or use case of correlation map.
@yangguah : Thank you.
Spring quarter's work
First try--using correlation formula in terms of statistics
The work I was given is to show the relation of two keywords at a state level, county level and city level in the twittermap. By using the correlation formula, I transferred the relation of two keywords into a number in the range of [-1,1]. To be specific, the correlation formula is For example, in every state, I divide it into 8x8 girds. For each gird I could get the count of tweets where the first keyword appears and the count of tweets where the second keyword appears and then I applied the correlation formula.
However, this method did not make sense because it only showed the correlation of the distribution of the population in each region but not the correlation between the two keywords. The correlation we defined here actually is the possibility of two keywords being talked both at the same tweet.