Closed lwwlwwl closed 1 year ago
The issue still exists. The reading file error came from the mismatch between the expected bytes of indices.bin and its actual bytes, where the actual one is double of the expected. The doubling may come from lines
src = np.concatenate((edges[0], edges[1]))
dst = np.concatenate((edges[1], edges[0]))
in the preprocessing script before calling coo_matrix()
. It seems that the reading file error can be solved by changing the lines to
src = edges[0]
dst = edges[1]
However, doing so will give a seg fault by the end of the call, specifically after line 237 in graph_loader.cc
std::cout << "Loading graph with " << dataset->num_nodes << " nodes and "
<< dataset->num_edges << " edges" << std::endl;
Do we need to double the edges in terms of undirected graphs as shown in the datagen script for products?
The gnnlab can support other datasets. The user should convert the graph data in the third dataset to the "csc" data format. Then give the metadata file.
As for com-friendster, the user should make sure these files exist.
$ tree /graph-learning/samgraph/com-friendster
/graph-learning/samgraph/com-friendster
├── indices.bin # csc indices stored as uint32
├── indptr.bin # csc indptr stored as uint32
├── meta.txt # dataset meta data
└── train_set.bin # trainset node id list as uint32
The graph is stored as "csc" format in the "indptr.bin" and "indices.bin" files.
The metadata file is a text file, and the following is an example of com-friendster dataset.
NUM_NODE 65608366
NUM_EDGE 3612134270
FEAT_DIM 140
NUM_CLASS 100
NUM_TRAIN_SET 1000000
NUM_VALID_SET 200000
NUM_TEST_SET 100000
The "train_set.bin" file is a binary file with trainset node IDs. The user needs to give the "test_set.bin" and "valid_set.bin" files when validating the training results.
There are no "feature" and "label" data in com-friendster dataset, so the user can not assign them. The gnnlab system will fill them with "random" values automatically when loading data from files.
Finally, the user can use tools in gnnlab to generate uint64 Vertex ID Format, Cache Rank Table ... (link)
The following com-frienster dataset in the "SNAP" website is a text format, which is different from the "products" dataset. So we can not just use the "products" processing code for "com-friendster". Maybe you can create a new script for com-friendster to generate "indptr.bin", "indices.bin", "meta.txt", "train_set.bin", "test_set.bin" and "valid_set.bin" files.
# Undirected graph: ../../data/output/friendster.txt
# Friendster
# Nodes: 65608366 Edges: 1806067135
# FromNodeId ToNodeId
101 102
101 104
101 107
...
Thanks for your reply! The dataset was processed successfully for running datagen commands. I will close the issue.
I saw gnnlab supports other datasets. Is there any particular procedure that should be followed? Specifically, I was trying to run ogb-mag240m with the citation edges. The dataset was processed as suggested above. When running ./32to64 -g command, I got
~/gnnlab/utility/data-process/build# ./32to64 -g mag240m
--graph: not in {reddit,products,papers100M,com-friendster,uk-2006-05,twitter,sk-2005}
Run with --help for more information.
It seems that the command can only take graph names in the list. So I just changed the name of the dir that contains all raw files needed for generating data to reddit. Then using the provided tools successfully generated uint64 Vertex ID Format, Cache Rank Table, etc. However, following the same steps, when it comes to training the dataset with cmd python train_graphsage.py --dataset reddit --num-epoch 5 --batch-size 1000 --num-train-worker 2 --num-sample-worker 1 --fanout 20 15 10
, I got
[2023-02-02 05:11:30.344883: F /app/source/fgnn/samgraph/common/cpu/cpu_device.cc:42] Check failed: (ret) == (0)
Aborted (core dumped)
I thought it might be because "there is insufficient memory available with the requested alignment" as suggested in the doc for posix_memalign(), but after decreasing the fanout to one layer of 5, the err was still there.
To add a new dataset besides {reddit,products,papers100M,com-friendster,uk-2006-05,twitter,sk-2005}
, these steps may help you. (Use ogb-mag240m
as an example.)
First, create a new directory in the datasets storage directory(e.g. /graph-learning/samgraph/ogb-mag240m
). Then store the above dataset files(link) in the directory.
For dataset processing tools, you can add ogb-mag240m
in L42 of file utility/data-process/common/options.cc
(link).
For gnnlab apps, you can add ogb-mag240m
in file.
As for the running core dumped
problem, could you give detailed debug information? For example, the backtrace info
in gdb.
If you make sure this problem occurs in the dataset loading, I guess the memory exceeding cause this problem. You can compare the total size of ogb-mag240m
dataset files(especially for feature storage size. If there is no feat.bin
file, the feature size is equal to NUM_NODE * FEAT_DIM
* sizeof(float) in the metafile.) and the size of machine memory to check it, or use free
command to monitor running memory usage.
Thanks for your reply. The backtrace info
for python train_graphsage.py --dataset twitter --batch-size 1000 --num-train-worker 3 --num-sample-worker 1 --num-epoch 5
is as follow:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f87369ff7f1 in __GI_abort () at abort.c:79
#2 0x00007f863f3cf839 in samgraph::common::LogMessageFatal::~LogMessageFatal (this=0x7ffe9cb37440, __in_chrg=<optimized out>,
__vtt_parm=<optimized out>) at /app/source/fgnn/samgraph/common/logging.cc:72
#3 0x00007f863f31b531 in samgraph::common::cpu::CPUDevice::AllocDataSpace (this=<optimized out>, ctx=..., nbytes=<optimized out>,
alignment=<optimized out>) at /app/source/fgnn/samgraph/common/cpu/cpu_device.cc:44
#4 0x00007f863f3fa4d7 in samgraph::common::WorkspacePool::Pool::Alloc (scale=1, nbytes=374021120000, device=<optimized out>, ctx=...,
this=0x556dfdaffd80) at /app/source/fgnn/samgraph/common/workspace_pool.cc:77
#5 samgraph::common::WorkspacePool::AllocWorkspace (this=0x556dfe622500, ctx=..., size=374021117952, scale=1)
at /app/source/fgnn/samgraph/common/workspace_pool.cc:177
#6 0x00007f863f3199e9 in samgraph::common::Tensor::EmptyNoScale (dtype=dtype@entry=samgraph::common::kF32, shape=..., ctx=..., name=...)
at /app/source/fgnn/samgraph/common/common.cc:163
#7 0x00007f863f3c8ae4 in samgraph::common::Engine::LoadGraphDataset (this=this@entry=0x556dfe52a1f0)
at /app/source/fgnn/samgraph/common/engine.cc:150
#8 0x00007f863f3a9664 in samgraph::common::dist::DistEngine::Init (this=0x556dfe52a1f0)
at /app/source/fgnn/samgraph/common/dist/dist_engine.cc:113
#9 0x00007f863f3d6998 in samgraph::common::samgraph_data_init () at /app/source/fgnn/samgraph/common/operation.cc:338
#10 0x00007f8735dd0630 in ffi_call_unix64 () from /miniconda3/envs/fgnn_env/lib/python3.8/lib-dynload/../../libffi.so.6
#11 0x00007f8735dcffed in ffi_call () from /miniconda3/envs/fgnn_env/lib/python3.8/lib-dynload/../../libffi.so.6
#12 0x00007f873706584a in _call_function_pointer (argcount=0, resmem=0x7ffe9cb38490, restype=<optimized out>, atypes=<optimized out>,
avalues=0x7ffe9cb38480, pProc=0x7f863f3d6910 <samgraph::common::samgraph_data_init()>, flags=4353)
at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/callproc.c:853
#13 _ctypes_callproc (pProc=<optimized out>, argtuple=0x7f8737176040, flags=<optimized out>, argtypes=0x0, restype=<optimized out>,
checker=<optimized out>) at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/callproc.c:1181
#14 0x00007f8737065f63 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0)
at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/_ctypes.c:4135
#15 0x0000556dfaf0831f in _PyObject_MakeTpCall () at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:159
#16 0x0000556dfaf93b49 in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f863fca3d20, callable=0x7f863fc2b880)
at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:125
#17 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x556dfb599720)
at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#18 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3469
#19 0x0000556dfaf57ebb in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>, co=<optimized out>)
at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:283
#20 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7f863fca3b78, func=0x7f863fcdb0d0)
at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:410
#21 _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7f863fca3b78, callable=0x7f863fcdb0d0)
---Type <return> to continue, or q <return> to quit---return
at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#22 method_vectorcall () at /tmp/build/80754af9/python_1573076469108/work/Objects/classobject.c:60
#23 0x0000556dfaed137e in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f863fca3b80, callable=0x7f863fc2c2c0)
at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#24 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x556dfb599720)
at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#25 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3469
#26 0x0000556dfaf57b1b in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>, co=<optimized out>)
at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:283
#27 _PyFunction_Vectorcall.localalias.357 () at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:410
#28 0x0000556dfaed13cc in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f87370b9970, callable=0x7f863fc2f310)
at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#29 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x556dfb599720)
at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#30 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3500
#31 0x0000556dfaf56c32 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4298
#32 0x0000556dfaf57a04 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4327
#33 0x0000556dfafeaf5c in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:718
#34 0x0000556dfafeb004 in run_eval_code_obj () at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1117
#35 0x0000556dfb020304 in run_mod () at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1139
#36 0x0000556dfb022c50 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1055
#37 0x0000556dfb022e41 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:428
#38 0x0000556dfb023983 in pymain_run_file (cf=0x7ffe9cb38d68, config=0x556dfb598c00)
at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:381
#39 pymain_run_python (exitcode=0x7ffe9cb38d60) at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:565
#40 Py_RunMain () at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:644
#41 0x0000556dfb023bc9 in Py_BytesMain () at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:698
#42 0x00007f87369e0c87 in __libc_start_main (main=0x556dfaee0720 <main>, argc=12, argv=0x7ffe9cb38f58, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe9cb38f48) at ../csu/libc-start.c:310
#43 0x0000556dfaface13 in _start () at ../sysdeps/x86_64/elf/start.S:103
Did you change the device context id? It seems like you give a wrong CPU context in the running code in #3
of backtrace info (code file ).
I did not change the device context id. I'm not sure why it shows line 44 is triggered instead of Check failed: (ret) == (0)
at line 42 as the error shown. Is there any way I can check if I gave a wrong CPU context?
You can use git diff ...
to compare your local code with our newest version in GNNLab. In this way, maybe you can find some modifications cause this problem. If you can not find which modification causes it through the git diff
result, you can put the git diff
result as an attachment in the next question.
Neither git diff ...
nor git diff
in the dir of gnnlab returns any output. In this case, I think there is no modification made on my side.
FYI I'm using docker env where there is a default repo in dir /app/source/fgnn
(based on the backtrace above, it looks like it is using that?) and I clone the latest code from github to the dir ~/gnnlab
and run the train script from there.
Can you print out ctx.device_id
at frame 3?
Sorry I'm not very familiar with using gdb. It shows <optimized out>
(gdb) select-frame 3
(gdb) p ctx.device_id
$1 = <optimized out>
(gdb) info frame
Stack level 3, frame at 0x7ffe69d2cd00:
rip = 0x7f352d160531 in samgraph::common::cpu::CPUDevice::AllocDataSpace
(/app/source/fgnn/samgraph/common/cpu/cpu_device.cc:44); saved rip = 0x7f352d23f4d7
called by frame at 0x7ffe69d2cd90, caller of frame at 0x7ffe69d2cb50
source language c++.
Arglist at 0x7ffe69d2cb48, args: this=<optimized out>, ctx=..., nbytes=<optimized out>,
alignment=<optimized out>
Locals at 0x7ffe69d2cb48, Previous frame's sp is 0x7ffe69d2cd00
Saved registers:
rbx at 0x7ffe69d2cce0, rbp at 0x7ffe69d2cce8, r12 at 0x7ffe69d2ccf0,
rip at 0x7ffe69d2ccf8
Then you can print ctx.device_id to console before /app/source/fgnn/samgraph/common/cpu/cpu_device.cc:44
, recompile the code and see what happens.
I saw you created issue #7, where you did modificed code to run gnnlab. Are you sure there's no output of git diff
? Can you verify which binary you are running, gnnlab under /app
or ~
?
Then you can print ctx.device_id to console before
/app/source/fgnn/samgraph/common/cpu/cpu_device.cc:44
, recompile the code and see what happens.
It keeps showing <optimized out>
as the value for ctx.device_id. I'm not really sure how to follow this.
I saw you created issue #7, where you did modificed code to run gnnlab. Are you sure there's no output of
git diff
? Can you verify which binary you are running, gnnlab under/app
or~
?
Yea there is no output for git diff
. For this current docker, I changed the value of mq_size to be 3300 in memory_queue.h (did not change mq_nbytes value) and then built the docker. After going into the docker, I cloned the latest repo to ~
. Then cd to ~/gnnlab/example/samgraph/multi_gpu/
and run train_graphsage.py
It keeps showing
<optimized out>
as the value for ctx.device_id. I'm not really sure how to follow this.
I mean modify the source code, using cout
or LOG(ERROR)
to print the value to console.
Yea there is no output for
git diff
. For this current docker, I changed the value of mq_size to be 3300 in memory_queue.h (did not change mq_nbytes value) and then built the docker. After going into the docker, I cloned the latest repo to~
. Then cd to~/gnnlab/example/samgraph/multi_gpu/
and runtrain_graphsage.py
Then what's the result of git diff
of gnnlab ouside docker or in /app
?
Also, does this abort only appears on your custom dataset, or on all other previously provided dataset?
I mean modify the source code, using
cout
orLOG(ERROR)
to print the value to console.
Do I need to modify the source code, rebuild the docker and rerun the command?
Then what's the result of
git diff
of gnnlab ouside docker or in/app
?
git diff
in /app/source/fgnn
complains not a git repository
This only appears for mag240m. I tried product, papers100m and com-friendster. These all worked fine.
There are too many copies of gnnlab in your docker environment now. There's (a) one outside your docker where you modified mq_size, (b) a copy of modified version in /app
path of the docker image when you build the image, and (c) a freshly cloned one at ~
insided the container. If you only have built the image, and did not manually build gnnlab using ./build.sh
or python setup.py xxx
, then the binary you are running is compiled from the (b). This may leads to strange behavior, since the c++ file, binary, installed python module comes from (b) and python scripts comes from (c).
I'm not sure if the docker copy preserves the .git
folder since :
git diff
in/app/source/fgnn
complainsnot a git repository
so it's also not easy to identify if you made additional changes to the source code.
I suggest you check and report the output of git diff
outside your docker, since it's likely the binary you are running is the compiled from the same copy of source code. Then adopt necessary modifications to your gnnlab source code in ~
of container, and use ./build.sh
to recompile and install the source code. This overrides both binary and python module in your container from (a)(b) to (c). This should make our discussion clearer, and easier to identify the source of the problem.
p.s., if you want to preserve modifications across container lifetime, you may need to recreate a container from existing image and add a volume mapping.
git diff
outside the docker shows the only difference is the change of the mq_size. I changed that value in (c) and run ./build.sh
. 3 versions should be the same rn.
The value of ctx.device_id
is 1.
That’s strange. 1 should be covered by previous if statement. Can you reproduce the abort’s traceback? The abort line may mismatch due to multiple copy. 2023年2月10日 +0800 11:03 Yuchen @.***>,写道:
git diff outside the docker shows the only difference is the change of the mq_size. I changed that value in (c) and run ./build.sh. 3 versions should be the same rn. The value of ctx.device_id is 1. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
[2023-02-10 03:11:31.118232: F /root/gnnlab/samgraph/common/cpu/cpu_device.cc:44] Check failed: (ret) == (0)
Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f88a9a807f1 in __GI_abort () at abort.c:79
#2 0x00007f87b2450949 in samgraph::common::LogMessageFatal::~LogMessageFatal (this=0x7ffd8ec05370,
__in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /root/gnnlab/samgraph/common/logging.cc:72
#3 0x00007f87b239c60f in samgraph::common::cpu::CPUDevice::AllocDataSpace (this=<optimized out>, ctx=...,
nbytes=374021120000, alignment=64) at /root/gnnlab/samgraph/common/cpu/cpu_device.cc:46
#4 0x00007f87b247b5e7 in samgraph::common::WorkspacePool::Pool::Alloc (scale=1, nbytes=374021120000,
device=<optimized out>, ctx=..., this=0x5614ded63d70) at /root/gnnlab/samgraph/common/workspace_pool.cc:77
#5 samgraph::common::WorkspacePool::AllocWorkspace (this=0x5614df8864f0, ctx=..., size=374021117952, scale=1)
at /root/gnnlab/samgraph/common/workspace_pool.cc:177
#6 0x00007f87b239aa19 in samgraph::common::Tensor::EmptyNoScale (dtype=dtype@entry=samgraph::common::kF32,
shape=..., ctx=..., name=...) at /root/gnnlab/samgraph/common/common.cc:163
#7 0x00007f87b2449bf4 in samgraph::common::Engine::LoadGraphDataset (this=this@entry=0x5614df78e1e0)
at /root/gnnlab/samgraph/common/engine.cc:150
#8 0x00007f87b242a774 in samgraph::common::dist::DistEngine::Init (this=0x5614df78e1e0)
at /root/gnnlab/samgraph/common/dist/dist_engine.cc:113
#9 0x00007f87b2457aa8 in samgraph::common::samgraph_data_init () at /root/gnnlab/samgraph/common/operation.cc:338
#10 0x00007f88a8e51630 in ffi_call_unix64 ()
from /miniconda3/envs/fgnn_env/lib/python3.8/lib-dynload/../../libffi.so.6
#11 0x00007f88a8e50fed in ffi_call () from /miniconda3/envs/fgnn_env/lib/python3.8/lib-dynload/../../libffi.so.6
#12 0x00007f88aa0e784a in _call_function_pointer (argcount=0, resmem=0x7ffd8ec063e0, restype=<optimized out>,
atypes=<optimized out>, avalues=0x7ffd8ec063d0, pProc=0x7f87b2457a20 <samgraph::common::samgraph_data_init()>,
flags=4353) at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/callproc.c:853
#13 _ctypes_callproc (pProc=<optimized out>, argtuple=0x7f88aa1f8040, flags=<optimized out>, argtypes=0x0,
restype=<optimized out>, checker=<optimized out>)
at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/callproc.c:1181
#14 0x00007f88aa0e7f63 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0)
at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/_ctypes.c:4135
#15 0x00005614dc49731f in _PyObject_MakeTpCall ()
at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:159
#16 0x00005614dc522b49 in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f87b2d23d20,
---Type <return> to continue, or q <return> to quit---
callable=0x7f87b2cac880) at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:125
#17 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5614dc7fd720)
at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#18 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3469
#19 0x00005614dc4e6ebb in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>,
co=<optimized out>) at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:283
#20 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7f87b2d23b78,
func=0x7f87b2d5c0d0) at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:410
#21 _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7f87b2d23b78,
callable=0x7f87b2d5c0d0) at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#22 method_vectorcall () at /tmp/build/80754af9/python_1573076469108/work/Objects/classobject.c:60
#23 0x00005614dc46037e in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f87b2d23b80,
callable=0x7f87b2ca7cc0) at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#24 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5614dc7fd720)
at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#25 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3469
#26 0x00005614dc4e6b1b in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>,
co=<optimized out>) at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:283
#27 _PyFunction_Vectorcall.localalias.357 () at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:410
#28 0x00005614dc4603cc in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f88aa13b970,
callable=0x7f87b2cae310) at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#29 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5614dc7fd720)
at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#30 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3500
#31 0x00005614dc4e5c32 in _PyEval_EvalCodeWithName ()
at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4298
#32 0x00005614dc4e6a04 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4327
#33 0x00005614dc579f5c in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:718
---Type <return> to continue, or q <return> to quit---
#34 0x00005614dc57a004 in run_eval_code_obj ()
at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1117
#35 0x00005614dc5af304 in run_mod () at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1139
#36 0x00005614dc5b1c50 in PyRun_FileExFlags ()
at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1055
#37 0x00005614dc5b1e41 in PyRun_SimpleFileExFlags ()
at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:428
#38 0x00005614dc5b2983 in pymain_run_file (cf=0x7ffd8ec06cb8, config=0x5614dc7fcc00)
at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:381
#39 pymain_run_python (exitcode=0x7ffd8ec06cb0) at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:565
#40 Py_RunMain () at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:644
#41 0x00005614dc5b2bc9 in Py_BytesMain () at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:698
#42 0x00007f88a9a61c87 in __libc_start_main (main=0x5614dc46f720 <main>, argc=12, argv=0x7ffd8ec06ea8,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd8ec06e98)
at ../csu/libc-start.c:310
#43 0x00005614dc53be13 in _start () at ../sysdeps/x86_64/elf/start.S:103
The line 44 corresponds to CHECK_EQ(ret, 0); The line 46 corresponds to CHECK(false);
I don’t understand why this happens… Can you show me the code how you add the cout? Also, to avoid unnoticed glitches, I suggest you build with “./build.sh -c”, which cleans temp output first.
void *CPUDevice::AllocDataSpace(Context ctx, size_t nbytes, size_t alignment) {
void *ptr; std::cout << "value of ctx.device_id: " << ctx.device_id << std::endl;
if (ctx.device_id == CPU_CUDA_HOST_MALLOC_DEVICE) {
CUDA_CALL(cudaHostAlloc(&ptr, nbytes, cudaHostAllocDefault));
} else if (ctx.device_id == CPU_CLIB_MALLOC_DEVICE) {
int ret = posix_memalign(&ptr, alignment, nbytes);
CHECK_EQ(ret, 0);
} else {
CHECK(false);
}
return ptr;
}
built with ./build.sh -c
and bt frame 3 gave the same info
I think I know. The else branch’s abort is combined with the second branch’s abort due to compiler optimization. That’s why the backtrace says line 46. The actual abort location is line 44, as the abort message reports, which I didn’t notice. The error should caused by allocating memory beyond capacity, since mag240 has significant large feature dimension.
Original Mag240 use float16 feature while gnnlab only supports float32 by the moment. Extending gnnlab to support fp16 can leads to lower memory occupation. The simplest method is to half the dimension in meta.txt, and force overwrite the dtype to fp16 and double the dimension when passing extracted feature to torch in “operation.cc”.
Thanks! It is working now
I'm trying to run Friendster dataset using gnnlab but got stuck when preparing the dataset for training. I assumed
com-friendster
in the script is the same as Friendster. Specifically, I gotwhen running
./32to64 -g com-friendster
,./cache-by-degree -g com-friendster
and./cache-by-random -g com-friendster
.Following datagen scripts for products, I used this script for data-gen to get the binary files. I'm running these preprocess steps on a machine without GPU but with big-enough memory. Not sure if this is causing the problem.
cmake version 3.16.3
Just wondering if gnnlab supports custom dataset. If so, where can I find the tutorial for it? Thanks!