amazon-science / tgl

Apache License 2.0
192 stars 31 forks source link

Segmentation fault (core dumped) when extending TGN to multiple layers #20

Open JakubReha opened 1 year ago

JakubReha commented 1 year ago

I get the Segmentation fault (core dumped), when I try to extend TGN to multiple layers TGN.yml the same way TGAT is defined. It seems that having multiple layers together with the "recent" sampling strategy is not compatible. Are you not supporting this setting?

JakubReha commented 1 year ago

2023-04-12 14_11_13-TGN yml (2) txt - Visual Studio Code  Administrator

This is my TGN.yml. The segmentation fault happens at line 199 in train.py when calling "sampler.sample(root_nodes, ts)".

shadow150519 commented 5 months ago

hello, do you work out this problem,I also get segment fault when using multi-hop recent sample, but multi-hop uniform sample works well

shadow150519 commented 5 months ago

well,I found why this segment fault will happen If you use 2 layer rencent sample and when you sample second layer, in the sample function

        void sample(std::vector<NodeIDType> &root_nodes, std::vector<TimeStampType> &root_ts)
        {
            // a weird bug, dgl library seems to modify the total number of threads
            omp_set_num_threads(num_threads);
            ret.resize(0);
            bool first_layer = true;
            bool use_ptr = false;
            for (int i = 0; i < num_layers; i++)
            { 
                ret.resize(ret.size() + num_history); 
                if ((first_layer) || ((prop_time) && num_history == 1) || (recent)) // here is why the bug occur, since we use recent sample, use_ptr will be true
                {
                    first_layer = false;
                    use_ptr = true;
                }
                else
                    use_ptr = false;
                if (i==0){
                    sample_layer(root_nodes, root_ts, num_neighbors[i], use_ptr, true);
                }
                else
                {                
                    sample_layer(root_nodes, root_ts, num_neighbors[i], use_ptr, false);
                }
            }
        }

in sampler_layer function, if use_ptr is true it will call update_ts_ptr, but since from_root is false, root_nodes is actually a nullptr, so it will crash in update_ts_ptr, I think the solution is initializing root_nodes and root_ts if from_root is false

        void sample_layer(std::vector<NodeIDType> &_root_nodes, std::vector<TimeStampType> &_root_ts,
                          int neighs, bool use_ptr, bool from_root)
        {
            double t_s = omp_get_wtime();
            std::vector<NodeIDType> *root_nodes;
            std::vector<TimeStampType> *root_ts;
            if (from_root)
            {
                root_nodes = &_root_nodes;
                root_ts = &_root_ts;
            }else { // add a else to initialize them
                root_nodes = &(ret[ret.size() - 1 - num_history].nodes);
                root_ts = &(ret[ret.size() - 1 - num_history].ts);
            }
            double t_ptr_s = omp_get_wtime();
            if (use_ptr) 
                update_ts_ptr(num_history, *root_nodes, *root_ts, 0);
            ret[0].ptr_time += omp_get_wtime() - t_ptr_s;

this solution works for me :)