NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.84k stars 897 forks source link

make_pair_iter pair_generator #273

Closed Joseph94m closed 5 years ago

Joseph94m commented 6 years ago

Hey,

I think the pair_list=[] should be moved under the While True statement in make_pair_iter function in pair_generator.

Why? Because we want pair_list to be reset for each time we need to yield training pairs to the model. Otherwise, pair_list will keep growing in memory as make_pair_iter never actually returns since it uses yield. This will ultimately yield to the list growing and causing a memory problem.

def make_pair_iter(self, rel):
    rel_set = {}
    #pair_list = []
    for label, d1, d2 in rel:
        if d1 not in rel_set:
            rel_set[d1] = {}
        if label not in rel_set[d1]:
            rel_set[d1][label] = []
        rel_set[d1][label].append(d2)

    while True:
        pair_list = []
        rel_set_sample = random.sample(rel_set.keys(), self.config['query_per_iter'])
        for d1 in rel_set_sample:
            label_list = sorted(rel_set[d1].keys(), reverse = True)
            for hidx, high_label in enumerate(label_list[:-1]):
                for low_label in label_list[hidx+1:]:
                    for high_d2 in rel_set[d1][high_label]:
                        for low_d2 in rel_set[d1][low_label]:
                            pair_list.append( (d1, high_d2, low_d2) )
        yield pair_list
bwanglzu commented 6 years ago

@Joseph94m thanks for your issue, we'll discuss it :) Do you think it's related to #264 , #40 , #233 ?

@faneshion this might also affect our 2.0 generator: https://github.com/faneshion/MatchZoo/blob/03e9bc0ac77edd5f299801511f550e25de965f7a/matchzoo/generators/point_generator.py#L64-L97

Joseph94m commented 6 years ago

Yes, I think the issue is related to the others because they also seem to be about memory leaks.

It is also my case where when I increase the size of the training corpus as well as the size of the relation files, the program gets really slow: for example, with a 500 mb corpus, and a relation file of 1GB (100k queries), the training time is abhorrently long. Maybe I misunderstood the configuration parameters? I tried reducing the display_interval (which is the steps in epoch) from 10 to 1. I also reduced the query_per_iter to 10 from 50. But it's still taking a long time.

In addition to that, before I made the change to the make_pair_iter, my program was running out of memory at around the 70th iteration, which lead me to assume that there was a memory leak somewhere since iterations are independent, and that is why I moved the initializtion of pair_list to make it local (inside of the while).

bwanglzu commented 6 years ago

@Joseph94m Yes, this is a critical issue in MatchZoo. I'll discuss with other people about it.

Probably we'll first fix it in branch 2.0, then master.

daltonj commented 6 years ago

Any update on this issue?

bwanglzu commented 6 years ago

@daltonj We've impelmented PairGenerator, PointGenerator and ListGenerator under branch 2.0. We're still working on the integration test :)

uduse commented 5 years ago

Closed due to inactivity.