Comparing Negative result of correlation to ppscore.

mohsin127 commented 4 years ago

I have a simple question regarding ppscore. When I was calculating the correlation between two datasets(Columns) the result was was -0.248, which means when one when data increases the other will decreases but when I calculated the ppscore of the same columns the result was 0.37 from x to y and 0 from y to x. It clearly indicates that x can predicts y with 0.37 ppscore and y cannot predict x.

But what I actually want to know is the relation between 2 datasets, either it is directly proportional (positive) or inversely proportional(negative) with each other.

Thank you,

FlorianWetschoreck commented 4 years ago

Thank you for opening this issue.

Without any further knowledge, I would be very careful of using the correlation score when the PPS is 0.37 in one direction but 0 in the other. This obviously means that there is only real predictive power in one direction and there is rather no symmetric (two-way) relation between the columns. (if the scores are valid and there was no error) What you should do is plot the data and have a look at the actual relation. Can you maybe share your data and the analysis? Then I might have a look at it.

If you need a single score for knowing what the direct two-way relation between the columns is, then PPS might not be the right score for you because it does not give you this direct interpretation and it is not suitable for this two-way, symmetric approach.

Can you maybe state why you need this directly proportional relation? What is the business use case or interpretation?

mohsin127 commented 4 years ago

Hello Sir, Thank you for the detailed answer. Actually what I am trying to find is relation between no of words in a post to comments, likes, shares and views. So for that I need to know either they are directly proportional or inversely proportional.

What I am getting from your reply is that if ppscore from x to y is 3.7 and y to x is 0, So this mean they are directly proportional to each other? If I am wrong please correct me.

FlorianWetschoreck commented 4 years ago

Is it possible that you share some data?

Why do you want to find the relation? What do you want to achieve? What is your overarching target? Do you want to predict the likes, shares, comments, views of posts?

mohsin127 commented 4 years ago

Actually what I want to know if with increasing the no of words in a post weather the no of likes, comments, views etc also increase or decrease? is there any relation between them. Either they increase directly on inverse with each other.

The data is very sensitive to be shared but they are just numbers. something like views = 2000 , no of words in a post =115. around 20,000 of this type of data.

FlorianWetschoreck commented 4 years ago

If it is possible, please share the (anonymized) data via a CSV e.g. via uploading to a Google Drive - so that I can have a look at it. If this is not possible due to confidentiality, you can reach out to 8080labs.com for consulting. Otherwise, I am afraid that I cannot help you

mohsin127 commented 4 years ago

Hello Florian, You can check the data here. https://docs.google.com/spreadsheets/d/1bvVXJP__eHmiX7KtiPp211Slor6JGQOBt8bSF3UrPRk/edit?usp=sharing

FlorianWetschoreck commented 4 years ago

Hi, thank you for sharing your data. I had a quick look at it. When I tried it, the PPS was always 0 which might be caused by the small amount of data. I also plotted it in various ways and also tried to use some binning and it made sense to me that the score was 0 because there hardly exists any pattern

Some observations:

there are only few posts which have many views - most have an average number of views. Some more, some less but usually in the same range
Most posts have between 75-200 words. Only some have below 75 words
There seems to be no clear relation between number of words and amount of views

Asymmetry:

When a post has a lot of views, it is a little bit more likely to have 140-200 words. However, when a post has between 140-200 words, then there is hardly any difference to posts which have 75-140 words. And actually, posts with 140-200 words have a slightly higher ratio of posts with many views but also a slightly higher ratio of posts with almost no views. So, this case "polarizes" the data into the extremes. However, only very slightly so.

Disclaimer: I would be careful with the findings because those might not be statistically relevant. In case those relationships would be relevant, the PPS could have found them. Since the PPS did not find them, this indicates that the relationships are not strong enough given cross-validation. However, on your bigger datasets, the PPS seemed to be bigger than 0, so the patterns might be valid.

Summing up, there is hardly any relation to find here and it is definitely not what you hoped for (with more words there are more views or inversely). The scenario and data is so complex, that I cannot answer you more questions about it for free. If you want you can try to hire 8080 Labs for consulting but I am afraid that this is out of scope for discussions in the context of ppscore

8080labs / ppscore

Comparing Negative result of correlation to ppscore. #36