000Justin000 / gnn-residual-correlation

20 stars 4 forks source link

About preprocessing the data #1

Closed padmaksha18 closed 3 years ago

padmaksha18 commented 3 years ago

Hello authors, can you kindly advise on how the adjacency.txt file has been converted to the adjacency matrix,I do not know julia and will be doing it in Python/Pytorch. I am not getting how the raw data has been converted to all the files in the /data directory to be read by the training code. Some advice on the data conversion will be really helpful. Thanks!

000Justin000 commented 3 years ago

Thanks for your interest!

Basically, each county (represented by its FIPS) is a vertex. We add undirected edges to connect a vertex to other counties that share its physical borders (given by adjacency.txt). For example, line 1-6 in adjacency.txt tell us that "Autauga County, AL" is connected to 5 other counties ("Chilton County, AL", "Dallas County, AL", "Elmore County, AL", "Lowndes County, AL", "Montgomery County, AL"), so we add 5 undirected edges to the graph. Does this answer your question?

padmaksha18 commented 3 years ago

hi, thank you so much for your quick response. In the case of a dynamic graph whose edges are updated every time, do we need to first create the graph and update the adjacency matrix every time using some libraries like networkx? Thank you.

000Justin000 commented 3 years ago

It really depends on your application. From your descriptions, I would say yes~

padmaksha18 commented 3 years ago

Thanks again! Actually, in my application, I am predicting covid cases across counties and considering the mobility flow between counties as edge features. So, the graph edges keep updating every time, therefore, I was thinking of dynamically forming the graph every time and then form the adjacency matrix out of it. The code here just requires the adjacency matrix for the correlations. Thanks

padmaksha18 commented 3 years ago

Hello Authors, I have a small doubt about the adjacency matrix, when I print the A.npy , I see there are values like "2", I was thinking if would be either "0" or "1" based on whether an edge exists between the counties or not, can you kindly clarify. Thanks

000Justin000 commented 3 years ago

Hey,

I guess you are talking about the A.npy in Junwen's implementation. I think the reason there are values like "2" could be 1) a self-loop is added from a county to itself 2) first an edge is added from A to B, then an edge is added from B to A.

In theory, you want to 1) remove self-loops 2) remove duplicated edges, which is what we did in the Julia implementation. However, the python implementation closely replicates the original results in the KDD paper, which indicates the minor differences in preprocessing do not matter that much in our application.

For your application, that could be different. I would suggest you 1) remove self-loops 2) remove duplicated edges just to be safe.

Best, Junteng

padmaksha18 commented 3 years ago

Hi Junteng, thank you so much for your suggestions. I understand your point that for infection mobility flow among counties the self loop does not make sense as we are consider inter county flow only, but if there is mobility flow in between two counties in either direction, does not it make sense to put it a more weightage like "2" in the adjacency matrix and "1" when there is mobility flow in one direction only? Thank you!

000Justin000 commented 3 years ago

If there is mobility flow in one direction only, you might want to use a directed graph.

padmaksha18 commented 3 years ago

hi Justin, thank you for your kind response. I have a fundamental doubt about the semi-supervised approach used in this paper, like you have done the training by minimizing the error between the base predictions of the GNN model built on 30% of the original data and the ground truth data. But I did not understand how the result of the prediction on one vertex will have a correlation with the prediction on a new vertex, like in your case considering US elections vote share prediction at county level. I am planning to build a similar network for covid death/new cases forecasting at county level, can I draw such a correlation in my case? Some insights into this will be really helpful to me. Thanking you in advance.

000Justin000 commented 3 years ago

"But I did not understand how the result of the prediction on one vertex will have a correlation with the prediction on a new vertex, like in your case considering US elections vote share prediction at the county level."

Our assumption is that the prediction residuals on adjacent vertices are correlated. Intuitively this is true because when you are making predictions on individual vertices you likely overlook some factors and those factors tend to be correlated among neighboring vertices. Therefore they would cause the prediction residuals (things you are not capturing) to be correlated.

I would be surprised if such a correlation does not exist in your case. Considering the covid spread geometrically.

padmaksha18 commented 3 years ago

Thanks again, Justin, for your explanation. So you mean to say that in addition to the 6 demographic features that you have considered as node features for the prediction, there might be many other features which are not considered and optimizing the error between the ground truth and base prediction will make our model more robust for prediction by capturing those residual/error correlations. In your case, you have considered about 30% of the vertices for the training and the prediction is made on the remaining 70 %, now the CGNN model learns the three params by optimizing the residual errors, but I did not understand the last line in the algorithm, I mean the CGNN predictions , "Adding predicted residuals on the test counties to the GNN base prediction substantially increases accuracy.", can you please explain a little on this. The CGNN model will be trained on the remaining 70% of the vertices/counties of the graph, that is what I understood. Thank you again.

000Justin000 commented 3 years ago

"The CGNN model will be trained on the remaining 70% of the vertices/counties of the graph, that is what I understood."

The CGNN model is trained on 30% of the vertices by maximizing the marginal likelihood.

padmaksha18 commented 3 years ago

okay, then the GNN base predictions are made on the same 30% training data? I was thinking test data sample is different from training. Please correct me.

000Justin000 commented 3 years ago

"the GNN base predictions are made on the same 30% training data" The GNN base predictions are trained on the same 30% training data. Later it is also used to generate base predictions on the rest 70% of vertices.

"test data sample is different from training" The test data samples are the rest 70% vertices, they are indeed different from training.

Think about the simplified way of training: First, the GNN base prediction is trained on the 30% data. However, there are still going to be prediction residuals which simply means the training loss is non-zero. Now, fit the \alpha, \beta parameters in CGNN on the residuals on the 30% training data. Does this make sense?

Now we take a step further. Instead of first learn GNN weights then learn \alpha, \beta, we train those parameters together.

padmaksha18 commented 3 years ago

Hi Justin, thank you, it is clear to me now. Just one question about the 30% training data are these any randomly chosen 30% of the 3K US counties, I am just curious to know how you have decided that 30% of the data is enough for this kind of training. Is it just an assumption? Thank you

000Justin000 commented 3 years ago

From my experiences, 900 data points are not too few. Of course, to avoid overfitting you need to add regularization, avoid too high hidden dimensions, etc

padmaksha18 commented 3 years ago

Thank you!