Split data up to training and test set so we can tweak locally and prevent overfitting

alphabetsoup commented 9 years ago

Can you explain this further Steve?

e.g. do you mean extract a test data set from the training data without replacement and then use that to compute a test AUC?

stevenxxiu commented 9 years ago

Yes that's it. I've setup this in master but it's giving auc's a lot higher than what kaggle is telling us. Can you take a look at the sampling process and see if there's anything wrong with it?

alphabetsoup commented 9 years ago

Sure, I'll have a look

alphabetsoup commented 9 years ago

Does main.dev() do the training?

stevenxxiu commented 9 years ago

Yes, you can just run main.py dev, gen_classif_data does the sampling. It removes edges from a graph within with and restores them once the with block finishes.

alphabetsoup commented 9 years ago

OK, so in your dev method:

1: def dev(g, pipeline):
2:     with gen_classif_data(g, 1000) as (dev_edges, dev_y):
3:         with gen_classif_data(g, 1000) as (train_edges, train_y):
4:             pipeline.fit(train_edges, train_y)
5:         print('training auc: {}'.format(roc_auc_score(
6:             train_y, pipeline.predict_proba(train_edges)[:, list(pipeline.classes_).index(1)]
7:         )))
8:         print('dev auc: {}'.format(roc_auc_score(
9:             dev_y, pipeline.predict_proba(dev_edges)[:, list(pipeline.classes_).index(1)]
10:        )))

Does line 2 extract 1000 edges first, then at line 3 trains the data set on another 1000 edges not including those from line 2?

Next question, assuming the above is true, are the extracted 1000 edges in each case always y=1? If this is true, don't we need some 1000 y=0 cases to help place the decision boundary?

alphabetsoup commented 9 years ago

I'm reading gen_classif_data now, it seems you extract non_edges too. I'll just run some tests.

stevenxxiu commented 9 years ago

Does line 2 extract 1000 edges first, then at line 3 trains the data set on another 1000 edges not including those from line 2?

Yes that's it. It extracts 2000 edges, removing 1000 from the graph and generating 1000 that's not in the graph.

alphabetsoup commented 9 years ago

On the auc difference, two things to note:

1) the original graph is already missing 100 edges, so this has skewed our training data 2) by removing 1000 more, the remaining train set is even more skewed.

I'll try both increasing the training sample size, and also bagging to try to remove the skew.

alphabetsoup commented 9 years ago

On an initial test running dev several times, the variance between two predicitons is quite large. The max individual difference is >0.7 on most occasions, which would definitely affect AUC!

stevenxxiu commented 9 years ago

What do you mean by variance between two predictions? The test set changes on each run, perhaps we should make it fixed by adding a random seed or saving to disk?

alphabetsoup commented 9 years ago

By variance I mean basically the difference between two runs.

Rather than fixing the random seed, we should take advantage of the opposite. That is, run it a lot and watch the variance of each prediciton on the test data. If there's a consistent distribution, to my understanding we can run bagging to obtain a better result.

alphabetsoup commented 9 years ago

Can I implement bagging (and at least contribute to your code)?

stevenxxiu commented 9 years ago

Sure give it a try. I think that belongs to a separate issue though? Here my goal was to get the dev set's auc to match the one on kaggle's so we don't have to submit to kaggle for parameter optimisation.

alphabetsoup commented 9 years ago

I think it's related to that issue. Do you see what I mean regarding bias of the training data?

stevenxxiu commented 9 years ago

Why is the training data biased? With some edges removed we can expect higher variance, but only slightly so since we are removing 1000 out of millions.

alphabetsoup commented 9 years ago

True, it's small, but I wanted to see if it was there.

Using an average of 25 trained bootstrap samples it made no difference to the test AUC on Kaggle :(

stevenxxiu commented 9 years ago

I've tried the following things, none improved AUC on kaggle, denote u as the 'from node', v as the 'to node'

limit node indegree and outdegree of u to >= 1 after removal of edge (what the test data seems to do)
no repeated u nodes (what the test data seems to do)

Things actually got worse, the AUC went to 71%, but locally it's 95%.

I also plotted a histogram but the local dev set looks very similar to the offical test set.

stevenxxiu commented 9 years ago

Here's the sampling code:

@contextmanager
def gen_classif_data(g, n):
    # sample some edges and non-edges
    us = set()
    edges = []
    while len(edges) < n:
        u = random.randrange(g.num_vertices)
        if u in us:
            continue
        if not (len(g.out_dict[u]) >= 2 and len(g.in_dict[u]) >= 1):
            continue
        v = random.choice(g.out_dict[u])
        if not len(g.in_dict[v]) >= 2:
            continue
        # remove edges since the edges to predict are not supposed to be in the training graph
        g.remove_edge(u, v)
        edges.append((u, v))
        us.add(u)
    non_edges = []
    while len(non_edges) < n:
        u = random.randrange(g.num_vertices)
        if u in us:
            continue
        v = random.randrange(g.num_vertices)
        if not (u != v and v not in g.out_dict[u]):
            continue
        if not (len(g.out_dict[u]) >= 1 and len(g.in_dict[u]) >= 1 and len(g.in_dict[v]) >= 1):
            continue
        non_edges.append((u, v))
        us.add(u)
    yield np.array(edges + non_edges), np.hstack([np.ones(n), np.zeros(n)])
    for u, v in edges:
        g.add_edge(u, v)

alphabetsoup commented 9 years ago

I did a fairly naive analysis and found that almost all the v nodes in the test set do not follow other nodes.

alphabetsoup commented 9 years ago

Steve I'm having trouble reading your code again. Sorry. Because you're working in a group please name your variables with meaningful names. What is "us"?

Edit: us appears to be the set of "from" nodes for edges that have already been included in the training data.

stevenxxiu commented 9 years ago

us is just many 'u's, I thought this was standard? You name a list of 'x's as xs and so forth.

It's purpose is so that no from node repeats itself.

But is there a name that would be clearer to you?

alphabetsoup commented 9 years ago

Your code

if not len(g.in_dict[v]) >= 2:
        continue

ensures that the "To" nodes for all training edges have degree >=2, right? What is the specific reason for this that we can write in the report? Does it help?

stevenxxiu commented 9 years ago

In degree >= 2. I'm just trying to approximate the test set here. Apparently every node in the test set has v's in degree to be >= 1 (and u's in, out degree as well, but not v's out degree). So this condition holds after removing the edge. I think they restrict it to make it harder for us to classify based on degree alone, say classify as a positive edge if the degree is 0.

stevenxxiu commented 9 years ago

Also posted this on the lms, hopefully ben will reply, and ideally give us the sampling code!

alphabetsoup commented 9 years ago

Good point on the positive edge if degree 0. The test set has been selected to disallow this.

I suspect that determinining the nature of the sampling code would give us an unfair advantage... that was the impression I got from the tutor anyway.

Regarding the v nodes in the test set, since they mostly do not follow any nodes in the training data (i.e. no "out" directed edges) I want to make this a condition of our sampling.

stevenxxiu commented 9 years ago

Are you sure about this?

>>> sorted(Counter(list(len(g.out_dict[v]) for u, v in test_edges)).items())[:10]
[(0, 1631),
 (1, 2),
 (3, 2),
 (4, 1),
 (5, 1),
 (6, 3),
 (8, 1),
 (9, 3),
 (10, 1),
 (11, 1)]

this is the only degree which includes 0.

I suspect that determinining the nature of the sampling code would give us an unfair advantage... that was the impression I got from the tutor anyway.

You'd think that the sampling would be known in real life though, say to predict who follows who next, we sample by restricting to the latest time interval as the test set, repeat for dev set, and the rest would be training.

alphabetsoup commented 9 years ago

From what I uderstand of your code above, 1631 edges point to v nodes which do not follow other nodes. Correct?

alphabetsoup commented 9 years ago

@contextmanager
def gen_classif_data(g, n):
    # sample some edges and non-edges
    us = set()
    edges = []
    while len(edges) < n:
        u = random.randrange(g.num_vertices)
        if u in us:
            continue
        if not (len(g.out_dict[u]) >= 2 and len(g.in_dict[u]) >= 1):
            continue
        v = random.choice(g.out_dict[u])
        if not len(g.in_dict[v]) >= 2:
            continue
        if len(g.out_dict[v]) > 0:
            continue
        # remove edges since the edges to predict are not supposed to be in the training graph
        g.remove_edge(u, v)
        edges.append((u, v))
        us.add(u)
    non_edges = []
    while len(non_edges) < n:
        u = random.randrange(g.num_vertices)
        if u in us:
            continue
        v = random.randrange(g.num_vertices)
        if not (u != v and v not in g.out_dict[u]):
            continue
        if not (len(g.out_dict[u]) >= 1 and len(g.in_dict[u]) >= 1 and len(g.in_dict[v]) >= 1):
            continue
        non_edges.append((u, v))
        us.add(u)
    yield np.array(edges + non_edges), np.hstack([np.ones(n), np.zeros(n)])
    for u, v in edges:
        g.add_edge(u, v)

stevenxxiu commented 9 years ago

From what I uderstand of your code above, 1631 edges point to v nodes which do not follow other nodes. Correct?

Point from v nodes. out_dict[v] is the nodes going out of v, so the "out degree" of v.

alphabetsoup commented 9 years ago

Point from v nodes.

V nodes are the "To" nodes, right?

stevenxxiu commented 9 years ago

 V nodes are the "To" nodes, right?

An edge is represented as (from, to), here named (u, v) so v are the to nodes yes.

alphabetsoup commented 9 years ago

Good, so in your code above and in my commit in laurence_test we both determined that out of all the edges in the test set, 1631 of these edges have v nodes that are black holes. They don't follow other nodes!

stevenxxiu commented 9 years ago

Yes that's right. But our sampling process is similar:

>>> sorted(Counter(list(len(g.out_dict[v]) for u, v in dev_edges)).items())[:10]
[(0, 1689),
 (1, 3),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 2),
 (7, 1),
 (8, 2),
 (11, 1)]

alphabetsoup commented 9 years ago

OK, but try it with the code I posted. I will do the same once my comp has finished rebuilding in 20 minutes.

alphabetsoup commented 9 years ago

How much worse does it make the Kaggle score?

stevenxxiu commented 9 years ago

0.71088, similar to my last one.

stevenxxiu commented 9 years ago

We should try to get the local auc down to 78% first I think to avoid wasting kaggle entries.

alphabetsoup commented 9 years ago

When I include the change I stated above I get a huge jump in the dev/train AUC:

train auc: 0.9996 train accuracy: 0.9945 dev auc: 0.9991 dev accuracy: 0.9915

Did you get this?? If so, I won't submit to Kaggle.

stevenxxiu commented 9 years ago

No I'm still getting 95%. Did you include the training edges by mistake?

alphabetsoup commented 9 years ago

Check latest commit.

In that case I'll try one Kaggle submit. After I've used bagging.

How do I use the GraphBaggingClassifier? You have

 #pipeline = GraphBaggingClassifier(pipeline, 10)

does this work?

stevenxxiu commented 9 years ago

I'm getting this:

train auc:      0.9247
train accuracy: 0.8655
dev auc:        0.9206
dev accuracy:   0.8535

stevenxxiu commented 9 years ago

Ok Ben's reponded on lms to try to compare our features on the dev set vs the test set via histograms. We should try that now.

alphabetsoup commented 9 years ago

I submitted twice to kaggle.

First had new sampling technique (excl v that follow others) and excluded the degree feature: 0.79! Second was the same, but included the degree feature: 0.61405!!!

I know one run isn't a trend, but the degree feature should probably be treated with scrutiny.

stevenxxiu commented 9 years ago

Good find! We're definitely dumping that.

alphabetsoup commented 9 years ago

Ok ben's reponded on lms to try to compare our features on the dev set vs the test set via histograms. Worth a try.

Good advice. I'll output some bins. Python histograms have bugs in my experience, so I'll do them in Matlab.

stevenxxiu commented 9 years ago

Interesting, degrees feature 3, 4 look very different. They are all for v.

stevenxxiu commented 9 years ago

Maybe the test sample samples random edges instead of the start node first then the end node?

alphabetsoup commented 9 years ago

You're right, test is more evenly spread to nodes of higher degree Degree[4] is the same for dev and test in my sample. Degree[3] is quite different. In dev, 700 have degree 0-1, but in test, that drops to 160.

hjohannes commented 9 years ago

sorry guys I'm trying to keep up with you but I'm pretty slow with python

alphabetsoup / smla1

Split data up to training and test set so we can tweak locally and prevent overfitting #8