JelenaBanjac / network-science-and-graphs

Assignments for the Network Tour of Data Science at EPFL (EE-558).
0 stars 0 forks source link

Questions to check #1

Open psiul opened 4 years ago

psiul commented 4 years ago

1) For the comparison of moments, do we need to compare deviations compared to the mean value as well? 2) In the course, they define moments as raw moments rather than central moments (slide set 1). I updated the answer in my branch based on that 3) For the top20 hubs, do they want us to delete the nodes or just remove the edges? If we only remove the edges, there are holes in the plot so the visualization makes more sense. I didn't act on it in the branch but have code in a local notebook 4)do we need to sum the power from 1 to 10 to get the number of path of length 10. i think you need to sum for path of length of at most 10 5) feature path matrix actually overflows. Any ideas about how we should handle it?

canaltinigne commented 4 years ago

For this one "do we need to sum the power from 1 to 10 to get the number of path of length 10. i think you need to sum for path of length of at most 10", isn't it asking getting the number of paths from 1 to 10, or I understood it wrong ?

canaltinigne commented 4 years ago

3, I deleted the nodes, but removing edges makes more sense for visualization, so you can change that part if you already wrote the version.

JelenaBanjac commented 4 years ago
  1. For the comparison of moments, do we need to compare deviations compared to the mean value as well?

I don't know if we need to compute deviations to the mean value for moments. I guess since they didn't mention anything, we can explain the difference only by the values of moments. We should probably ask that.

  1. In the course, they define moments as raw moments rather than central moments (slide set 1). I updated the answer in my branch based on that

Yes, thanks, I also see they say that the 2nd moment is <k2> in the slides.

But there is still a question of how we explain the difference in moments between citation and feature graphs when looking at their distributions? Correct me if I am wrong:

So, for example if we take citation graph, from the degree distribution we see it has power law, therefore it is probably scale free graph, so we expect only the first moment to be finite, but all higher ones to diverge? But the result we get is:

1st moment of citation graph: 2.8728606356968216
2nd moment of citation graph: 23.765281173594133

So it seems the 2nd moment is fine.

On the other hand, if we take a feature graph, it has a degree distribution which looks like Poisson distribution (with small λ), therefore it is probably a random graph, so we expect the 1st and 2nd moments to be finite? But the result we get is:

1st moment of feature graph: 334.4034229828851
2nd moment of feature graph: 167201.19804400977
5th moment of feature graph: 61316925573216.65

So it seems that the higher moments grow very rapidly.

Can you see what could be wrong? We can discuss tomorrow.

  1. For the top20 hubs, do they want us to delete the nodes or just remove the edges? If we only remove the edges, there are holes in the plot so the visualization makes more sense. I didn't act on it in the branch but have code in a local notebook

I agree. I guess it is fine to remove nodes since the hubs are defined as nodes, but for visualization, we just remove the edges and leave the nodes (hubs). I tried it now in my brach, but there is no much visual change though (unless we put 120 instead of 20 :) ).

  1. do we need to sum the power from 1 to 10 to get the number of path of length 10. i think you need to sum for path of length of at most 10

I think so as well, so sth like this?

# gives the matrix of walks of len 10
path_matrix_citation = np.linalg.matrix_power(A_citation, 10)

# [this] gives the sum from 1 to 10
path_matrix_citation =np.sum([np.linalg.matrix_power(A_citation, i) for i in range(11)], axis=0)
  1. feature path matrix actually overflows. Any ideas about how we should handle it?

That is a really good question, the max N it can handle is 1 and the pruned one seems to have similarities for it. I guess it is because our feature matrix is not sparse, the min element is not 0. So even if we could plot it until 10th degree, it will all be black (?). I mean if we see the adjacencu matrices of feature and citation graph in hubs removal part, the feature graph has significantly more non-zero values than the citation graph. I don't know. Maybe we should ask TAs...

*Some of these additional things are on my branch.

canaltinigne commented 4 years ago

Isn't the question asking number of paths with length of 1 to 10 ?

https://i.stack.imgur.com/V1b1i.jpg

JelenaBanjac commented 4 years ago

Isn't the question asking number of paths with length of 1 to 10 ?

https://i.stack.imgur.com/V1b1i.jpg

Yes, I think that is the question. We calculate paths of length 1, the paths of length 2,..., paths of length 10. Then we sum them (element-wise, what else) into P. So the Pij will give the total number of paths from i to j with the length of 0 to 10 (inclusive).


Okey, so this is how I understand the question-answer (for example N=2):

From the wiki they gave us: A0, the element (i, j) just gives the number of walks/paths of length 0 (so the diagonal of the matrix are all 1s). A1, the element (i, j) just gives the number of walks/paths of length 1 (not including walks of length 0). A2, the element (i, j) just gives the number of walks/paths of length 2 (not including walks of length 1 and 0). A3, the element (i, j) just gives the number of walks/paths of length 3 (not including walks of length 2, 1 and 0).

In the notebook, they define Ck(i, j) as the number of paths of length k from node i to j. So it seems that Ak[i, j] = Ck(i, j)

In the end, we need path matrix P, which is defined by the sum there. For our example N=2, the sum will be: Pij = C0(i, j) + C1(i, j) + C2(i, j) = A0[i, j] + A1[i, j] + A2[i, j].

So, in code:

# this is only matrix A^2
np.linalg.matrix_power(A, 2)

# this is the path P (the sum, until including N=2)
np.sum([np.linalg.matrix_power(A, i) for i in range(3)], axis=0)
psiul commented 4 years ago

Hey folks,

-For the path matrix, I thought about it again. You are right, I am sorry. It has to be a sum.

-For the overflow issue, since we know that the diameter is 2, the plot should be all black. Maybe we can just leave a note on that.

-For the moments, I agree with Jelena's analysis. "For finite-size samples drawn from such distribution, this behavior implies that the central moment estimators (like the mean and the variance) for diverging moments will never converge – as more data is accumulated, they continue to grow." We can only test whether that holds by increasing the size of the citation graph. My guess is that the feature graph moments explode with the exponent (as expected) but do not change for larger graphs.

See you in class!