SJTU-IPADS / gnnlab

A Factored System for Sample-based GNN Training over GPUs
Apache License 2.0
42 stars 7 forks source link

Reading file error when generating cache rank table #9

Closed lwwlwwl closed 1 year ago

lwwlwwl commented 1 year ago

I'm trying to run Friendster dataset using gnnlab but got stuck when preparing the dataset for training. I assumed com-friendster in the script is the same as Friendster. Specifically, I got

Root: /graph-learning/samgraph/
Graph: com-friendster
Threads: 48
64 bit: 0
Loading graph data from /graph-learning/samgraph/com-friendster/
Reading file error: /graph-learning/samgraph/com-friendster/indices.bin

when running ./32to64 -g com-friendster, ./cache-by-degree -g com-friendster and ./cache-by-random -g com-friendster.

Following datagen scripts for products, I used this script for data-gen to get the binary files. I'm running these preprocess steps on a machine without GPU but with big-enough memory. Not sure if this is causing the problem.

cmake version 3.16.3

Just wondering if gnnlab supports custom dataset. If so, where can I find the tutorial for it? Thanks!

lwwlwwl commented 1 year ago

The issue still exists. The reading file error came from the mismatch between the expected bytes of indices.bin and its actual bytes, where the actual one is double of the expected. The doubling may come from lines

src = np.concatenate((edges[0], edges[1]))
dst = np.concatenate((edges[1], edges[0]))

in the preprocessing script before calling coo_matrix(). It seems that the reading file error can be solved by changing the lines to

src = edges[0]
dst = edges[1]

However, doing so will give a seg fault by the end of the call, specifically after line 237 in graph_loader.cc

std::cout << "Loading graph with " << dataset->num_nodes << " nodes and "
            << dataset->num_edges << " edges" << std::endl;

Do we need to double the edges in terms of undirected graphs as shown in the datagen script for products?

lixiaobai09 commented 1 year ago

The gnnlab can support other datasets. The user should convert the graph data in the third dataset to the "csc" data format. Then give the metadata file.

As for com-friendster, the user should make sure these files exist.

$ tree /graph-learning/samgraph/com-friendster
/graph-learning/samgraph/com-friendster
├── indices.bin               # csc indices stored as uint32
├── indptr.bin                # csc indptr stored as uint32
├── meta.txt                  # dataset meta data
└── train_set.bin             # trainset node id list as uint32

The graph is stored as "csc" format in the "indptr.bin" and "indices.bin" files.

The metadata file is a text file, and the following is an example of com-friendster dataset.

NUM_NODE      65608366
NUM_EDGE      3612134270
FEAT_DIM      140
NUM_CLASS     100
NUM_TRAIN_SET 1000000
NUM_VALID_SET 200000
NUM_TEST_SET  100000

The "train_set.bin" file is a binary file with trainset node IDs. The user needs to give the "test_set.bin" and "valid_set.bin" files when validating the training results.

There are no "feature" and "label" data in com-friendster dataset, so the user can not assign them. The gnnlab system will fill them with "random" values automatically when loading data from files.

Finally, the user can use tools in gnnlab to generate uint64 Vertex ID Format, Cache Rank Table ... (link)

The following com-frienster dataset in the "SNAP" website is a text format, which is different from the "products" dataset. So we can not just use the "products" processing code for "com-friendster". Maybe you can create a new script for com-friendster to generate "indptr.bin", "indices.bin", "meta.txt", "train_set.bin", "test_set.bin" and "valid_set.bin" files.

# Undirected graph: ../../data/output/friendster.txt
# Friendster
# Nodes: 65608366 Edges: 1806067135
# FromNodeId    ToNodeId
101 102
101 104
101 107
...
lwwlwwl commented 1 year ago

Thanks for your reply! The dataset was processed successfully for running datagen commands. I will close the issue.

lwwlwwl commented 1 year ago

I saw gnnlab supports other datasets. Is there any particular procedure that should be followed? Specifically, I was trying to run ogb-mag240m with the citation edges. The dataset was processed as suggested above. When running ./32to64 -g command, I got

~/gnnlab/utility/data-process/build# ./32to64 -g mag240m
--graph:  not in {reddit,products,papers100M,com-friendster,uk-2006-05,twitter,sk-2005}
Run with --help for more information.

It seems that the command can only take graph names in the list. So I just changed the name of the dir that contains all raw files needed for generating data to reddit. Then using the provided tools successfully generated uint64 Vertex ID Format, Cache Rank Table, etc. However, following the same steps, when it comes to training the dataset with cmd python train_graphsage.py --dataset reddit --num-epoch 5 --batch-size 1000 --num-train-worker 2 --num-sample-worker 1 --fanout 20 15 10, I got

[2023-02-02 05:11:30.344883: F /app/source/fgnn/samgraph/common/cpu/cpu_device.cc:42] Check failed: (ret) == (0) 
Aborted (core dumped)

I thought it might be because "there is insufficient memory available with the requested alignment" as suggested in the doc for posix_memalign(), but after decreasing the fanout to one layer of 5, the err was still there.

lixiaobai09 commented 1 year ago

To add a new dataset besides {reddit,products,papers100M,com-friendster,uk-2006-05,twitter,sk-2005}, these steps may help you. (Use ogb-mag240m as an example.)

First, create a new directory in the datasets storage directory(e.g. /graph-learning/samgraph/ogb-mag240m). Then store the above dataset files(link) in the directory.

For dataset processing tools, you can add ogb-mag240m in L42 of file utility/data-process/common/options.cc(link).

For gnnlab apps, you can add ogb-mag240m in file.


As for the running core dumped problem, could you give detailed debug information? For example, the backtrace info in gdb. If you make sure this problem occurs in the dataset loading, I guess the memory exceeding cause this problem. You can compare the total size of ogb-mag240m dataset files(especially for feature storage size. If there is no feat.bin file, the feature size is equal to NUM_NODE * FEAT_DIM * sizeof(float) in the metafile.) and the size of machine memory to check it, or use free command to monitor running memory usage.

lwwlwwl commented 1 year ago

Thanks for your reply. The backtrace info for python train_graphsage.py --dataset twitter --batch-size 1000 --num-train-worker 3 --num-sample-worker 1 --num-epoch 5 is as follow:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f87369ff7f1 in __GI_abort () at abort.c:79
#2  0x00007f863f3cf839 in samgraph::common::LogMessageFatal::~LogMessageFatal (this=0x7ffe9cb37440, __in_chrg=<optimized out>, 
    __vtt_parm=<optimized out>) at /app/source/fgnn/samgraph/common/logging.cc:72
#3  0x00007f863f31b531 in samgraph::common::cpu::CPUDevice::AllocDataSpace (this=<optimized out>, ctx=..., nbytes=<optimized out>, 
    alignment=<optimized out>) at /app/source/fgnn/samgraph/common/cpu/cpu_device.cc:44
#4  0x00007f863f3fa4d7 in samgraph::common::WorkspacePool::Pool::Alloc (scale=1, nbytes=374021120000, device=<optimized out>, ctx=..., 
    this=0x556dfdaffd80) at /app/source/fgnn/samgraph/common/workspace_pool.cc:77
#5  samgraph::common::WorkspacePool::AllocWorkspace (this=0x556dfe622500, ctx=..., size=374021117952, scale=1)
    at /app/source/fgnn/samgraph/common/workspace_pool.cc:177
#6  0x00007f863f3199e9 in samgraph::common::Tensor::EmptyNoScale (dtype=dtype@entry=samgraph::common::kF32, shape=..., ctx=..., name=...)
    at /app/source/fgnn/samgraph/common/common.cc:163
#7  0x00007f863f3c8ae4 in samgraph::common::Engine::LoadGraphDataset (this=this@entry=0x556dfe52a1f0)
    at /app/source/fgnn/samgraph/common/engine.cc:150
#8  0x00007f863f3a9664 in samgraph::common::dist::DistEngine::Init (this=0x556dfe52a1f0)
    at /app/source/fgnn/samgraph/common/dist/dist_engine.cc:113
#9  0x00007f863f3d6998 in samgraph::common::samgraph_data_init () at /app/source/fgnn/samgraph/common/operation.cc:338
#10 0x00007f8735dd0630 in ffi_call_unix64 () from /miniconda3/envs/fgnn_env/lib/python3.8/lib-dynload/../../libffi.so.6
#11 0x00007f8735dcffed in ffi_call () from /miniconda3/envs/fgnn_env/lib/python3.8/lib-dynload/../../libffi.so.6
#12 0x00007f873706584a in _call_function_pointer (argcount=0, resmem=0x7ffe9cb38490, restype=<optimized out>, atypes=<optimized out>, 
    avalues=0x7ffe9cb38480, pProc=0x7f863f3d6910 <samgraph::common::samgraph_data_init()>, flags=4353)
    at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/callproc.c:853
#13 _ctypes_callproc (pProc=<optimized out>, argtuple=0x7f8737176040, flags=<optimized out>, argtypes=0x0, restype=<optimized out>, 
    checker=<optimized out>) at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/callproc.c:1181
#14 0x00007f8737065f63 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0)
    at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/_ctypes.c:4135
#15 0x0000556dfaf0831f in _PyObject_MakeTpCall () at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:159
#16 0x0000556dfaf93b49 in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f863fca3d20, callable=0x7f863fc2b880)
    at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:125
#17 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x556dfb599720)
    at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#18 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3469
#19 0x0000556dfaf57ebb in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:283
#20 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7f863fca3b78, func=0x7f863fcdb0d0)
    at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:410
#21 _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7f863fca3b78, callable=0x7f863fcdb0d0)
---Type <return> to continue, or q <return> to quit---return
    at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#22 method_vectorcall () at /tmp/build/80754af9/python_1573076469108/work/Objects/classobject.c:60
#23 0x0000556dfaed137e in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f863fca3b80, callable=0x7f863fc2c2c0)
    at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#24 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x556dfb599720)
    at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#25 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3469
#26 0x0000556dfaf57b1b in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:283
#27 _PyFunction_Vectorcall.localalias.357 () at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:410
#28 0x0000556dfaed13cc in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f87370b9970, callable=0x7f863fc2f310)
    at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#29 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x556dfb599720)
    at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#30 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3500
#31 0x0000556dfaf56c32 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4298
#32 0x0000556dfaf57a04 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4327
#33 0x0000556dfafeaf5c in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:718
#34 0x0000556dfafeb004 in run_eval_code_obj () at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1117
#35 0x0000556dfb020304 in run_mod () at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1139
#36 0x0000556dfb022c50 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1055
#37 0x0000556dfb022e41 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:428
#38 0x0000556dfb023983 in pymain_run_file (cf=0x7ffe9cb38d68, config=0x556dfb598c00)
    at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:381
#39 pymain_run_python (exitcode=0x7ffe9cb38d60) at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:565
#40 Py_RunMain () at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:644
#41 0x0000556dfb023bc9 in Py_BytesMain () at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:698
#42 0x00007f87369e0c87 in __libc_start_main (main=0x556dfaee0720 <main>, argc=12, argv=0x7ffe9cb38f58, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe9cb38f48) at ../csu/libc-start.c:310
#43 0x0000556dfaface13 in _start () at ../sysdeps/x86_64/elf/start.S:103
lixiaobai09 commented 1 year ago

Did you change the device context id? It seems like you give a wrong CPU context in the running code in #3 of backtrace info (code file ).

lwwlwwl commented 1 year ago

I did not change the device context id. I'm not sure why it shows line 44 is triggered instead of Check failed: (ret) == (0) at line 42 as the error shown. Is there any way I can check if I gave a wrong CPU context?

lixiaobai09 commented 1 year ago

You can use git diff ... to compare your local code with our newest version in GNNLab. In this way, maybe you can find some modifications cause this problem. If you can not find which modification causes it through the git diff result, you can put the git diff result as an attachment in the next question.

lwwlwwl commented 1 year ago

Neither git diff ... nor git diff in the dir of gnnlab returns any output. In this case, I think there is no modification made on my side.

FYI I'm using docker env where there is a default repo in dir /app/source/fgnn(based on the backtrace above, it looks like it is using that?) and I clone the latest code from github to the dir ~/gnnlab and run the train script from there.

molamooo commented 1 year ago

Can you print out ctx.device_id at frame 3?

lwwlwwl commented 1 year ago

Sorry I'm not very familiar with using gdb. It shows <optimized out>

(gdb) select-frame 3
(gdb) p ctx.device_id
$1 = <optimized out>
(gdb) info frame
Stack level 3, frame at 0x7ffe69d2cd00:
 rip = 0x7f352d160531 in samgraph::common::cpu::CPUDevice::AllocDataSpace
    (/app/source/fgnn/samgraph/common/cpu/cpu_device.cc:44); saved rip = 0x7f352d23f4d7
 called by frame at 0x7ffe69d2cd90, caller of frame at 0x7ffe69d2cb50
 source language c++.
 Arglist at 0x7ffe69d2cb48, args: this=<optimized out>, ctx=..., nbytes=<optimized out>, 
    alignment=<optimized out>
 Locals at 0x7ffe69d2cb48, Previous frame's sp is 0x7ffe69d2cd00
 Saved registers:
  rbx at 0x7ffe69d2cce0, rbp at 0x7ffe69d2cce8, r12 at 0x7ffe69d2ccf0,
  rip at 0x7ffe69d2ccf8
molamooo commented 1 year ago

Then you can print ctx.device_id to console before /app/source/fgnn/samgraph/common/cpu/cpu_device.cc:44, recompile the code and see what happens.

molamooo commented 1 year ago

I saw you created issue #7, where you did modificed code to run gnnlab. Are you sure there's no output of git diff? Can you verify which binary you are running, gnnlab under /app or ~?

lwwlwwl commented 1 year ago

Then you can print ctx.device_id to console before /app/source/fgnn/samgraph/common/cpu/cpu_device.cc:44, recompile the code and see what happens.

It keeps showing <optimized out> as the value for ctx.device_id. I'm not really sure how to follow this.

I saw you created issue #7, where you did modificed code to run gnnlab. Are you sure there's no output of git diff? Can you verify which binary you are running, gnnlab under /app or ~?

Yea there is no output for git diff. For this current docker, I changed the value of mq_size to be 3300 in memory_queue.h (did not change mq_nbytes value) and then built the docker. After going into the docker, I cloned the latest repo to ~. Then cd to ~/gnnlab/example/samgraph/multi_gpu/ and run train_graphsage.py

molamooo commented 1 year ago

It keeps showing <optimized out> as the value for ctx.device_id. I'm not really sure how to follow this.

I mean modify the source code, using cout or LOG(ERROR) to print the value to console.

Yea there is no output for git diff. For this current docker, I changed the value of mq_size to be 3300 in memory_queue.h (did not change mq_nbytes value) and then built the docker. After going into the docker, I cloned the latest repo to ~. Then cd to ~/gnnlab/example/samgraph/multi_gpu/ and run train_graphsage.py

Then what's the result of git diff of gnnlab ouside docker or in /app?

Also, does this abort only appears on your custom dataset, or on all other previously provided dataset?

lwwlwwl commented 1 year ago

I mean modify the source code, using cout or LOG(ERROR) to print the value to console.

Do I need to modify the source code, rebuild the docker and rerun the command?

Then what's the result of git diff of gnnlab ouside docker or in /app?

git diff in /app/source/fgnn complains not a git repository

This only appears for mag240m. I tried product, papers100m and com-friendster. These all worked fine.

molamooo commented 1 year ago

There are too many copies of gnnlab in your docker environment now. There's (a) one outside your docker where you modified mq_size, (b) a copy of modified version in /app path of the docker image when you build the image, and (c) a freshly cloned one at ~ insided the container. If you only have built the image, and did not manually build gnnlab using ./build.sh or python setup.py xxx, then the binary you are running is compiled from the (b). This may leads to strange behavior, since the c++ file, binary, installed python module comes from (b) and python scripts comes from (c).

I'm not sure if the docker copy preserves the .git folder since :

git diff in /app/source/fgnn complains not a git repository

so it's also not easy to identify if you made additional changes to the source code.

I suggest you check and report the output of git diff outside your docker, since it's likely the binary you are running is the compiled from the same copy of source code. Then adopt necessary modifications to your gnnlab source code in ~ of container, and use ./build.sh to recompile and install the source code. This overrides both binary and python module in your container from (a)(b) to (c). This should make our discussion clearer, and easier to identify the source of the problem.

p.s., if you want to preserve modifications across container lifetime, you may need to recreate a container from existing image and add a volume mapping.

lwwlwwl commented 1 year ago

git diff outside the docker shows the only difference is the change of the mq_size. I changed that value in (c) and run ./build.sh. 3 versions should be the same rn.

The value of ctx.device_id is 1.

molamooo commented 1 year ago

That’s strange. 1 should be covered by previous if statement. Can you reproduce the abort’s traceback? The abort line may mismatch due to multiple copy. 2023年2月10日 +0800 11:03 Yuchen @.***>,写道:

git diff outside the docker shows the only difference is the change of the mq_size. I changed that value in (c) and run ./build.sh. 3 versions should be the same rn. The value of ctx.device_id is 1. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

lwwlwwl commented 1 year ago
[2023-02-10 03:11:31.118232: F /root/gnnlab/samgraph/common/cpu/cpu_device.cc:44] Check failed: (ret) == (0) 

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f88a9a807f1 in __GI_abort () at abort.c:79
#2  0x00007f87b2450949 in samgraph::common::LogMessageFatal::~LogMessageFatal (this=0x7ffd8ec05370, 
    __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /root/gnnlab/samgraph/common/logging.cc:72
#3  0x00007f87b239c60f in samgraph::common::cpu::CPUDevice::AllocDataSpace (this=<optimized out>, ctx=..., 
    nbytes=374021120000, alignment=64) at /root/gnnlab/samgraph/common/cpu/cpu_device.cc:46
#4  0x00007f87b247b5e7 in samgraph::common::WorkspacePool::Pool::Alloc (scale=1, nbytes=374021120000, 
    device=<optimized out>, ctx=..., this=0x5614ded63d70) at /root/gnnlab/samgraph/common/workspace_pool.cc:77
#5  samgraph::common::WorkspacePool::AllocWorkspace (this=0x5614df8864f0, ctx=..., size=374021117952, scale=1)
    at /root/gnnlab/samgraph/common/workspace_pool.cc:177
#6  0x00007f87b239aa19 in samgraph::common::Tensor::EmptyNoScale (dtype=dtype@entry=samgraph::common::kF32, 
    shape=..., ctx=..., name=...) at /root/gnnlab/samgraph/common/common.cc:163
#7  0x00007f87b2449bf4 in samgraph::common::Engine::LoadGraphDataset (this=this@entry=0x5614df78e1e0)
    at /root/gnnlab/samgraph/common/engine.cc:150
#8  0x00007f87b242a774 in samgraph::common::dist::DistEngine::Init (this=0x5614df78e1e0)
    at /root/gnnlab/samgraph/common/dist/dist_engine.cc:113
#9  0x00007f87b2457aa8 in samgraph::common::samgraph_data_init () at /root/gnnlab/samgraph/common/operation.cc:338
#10 0x00007f88a8e51630 in ffi_call_unix64 ()
   from /miniconda3/envs/fgnn_env/lib/python3.8/lib-dynload/../../libffi.so.6
#11 0x00007f88a8e50fed in ffi_call () from /miniconda3/envs/fgnn_env/lib/python3.8/lib-dynload/../../libffi.so.6
#12 0x00007f88aa0e784a in _call_function_pointer (argcount=0, resmem=0x7ffd8ec063e0, restype=<optimized out>, 
    atypes=<optimized out>, avalues=0x7ffd8ec063d0, pProc=0x7f87b2457a20 <samgraph::common::samgraph_data_init()>, 
    flags=4353) at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/callproc.c:853
#13 _ctypes_callproc (pProc=<optimized out>, argtuple=0x7f88aa1f8040, flags=<optimized out>, argtypes=0x0, 
    restype=<optimized out>, checker=<optimized out>)
    at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/callproc.c:1181
#14 0x00007f88aa0e7f63 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0)
    at /usr/local/src/conda/python-3.8.0/Modules/_ctypes/_ctypes.c:4135
#15 0x00005614dc49731f in _PyObject_MakeTpCall ()
    at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:159
#16 0x00005614dc522b49 in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f87b2d23d20, 
---Type <return> to continue, or q <return> to quit---
    callable=0x7f87b2cac880) at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:125
#17 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5614dc7fd720)
    at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#18 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3469
#19 0x00005614dc4e6ebb in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>, 
    co=<optimized out>) at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:283
#20 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7f87b2d23b78, 
    func=0x7f87b2d5c0d0) at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:410
#21 _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7f87b2d23b78, 
    callable=0x7f87b2d5c0d0) at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#22 method_vectorcall () at /tmp/build/80754af9/python_1573076469108/work/Objects/classobject.c:60
#23 0x00005614dc46037e in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f87b2d23b80, 
    callable=0x7f87b2ca7cc0) at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#24 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5614dc7fd720)
    at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#25 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3469
#26 0x00005614dc4e6b1b in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>, 
    co=<optimized out>) at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:283
#27 _PyFunction_Vectorcall.localalias.357 () at /tmp/build/80754af9/python_1573076469108/work/Objects/call.c:410
#28 0x00005614dc4603cc in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f88aa13b970, 
    callable=0x7f87b2cae310) at /tmp/build/80754af9/python_1573076469108/work/Include/cpython/abstract.h:127
#29 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5614dc7fd720)
    at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4987
#30 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:3500
#31 0x00005614dc4e5c32 in _PyEval_EvalCodeWithName ()
    at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4298
#32 0x00005614dc4e6a04 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:4327
#33 0x00005614dc579f5c in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at /tmp/build/80754af9/python_1573076469108/work/Python/ceval.c:718
---Type <return> to continue, or q <return> to quit---
#34 0x00005614dc57a004 in run_eval_code_obj ()
    at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1117
#35 0x00005614dc5af304 in run_mod () at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1139
#36 0x00005614dc5b1c50 in PyRun_FileExFlags ()
    at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:1055
#37 0x00005614dc5b1e41 in PyRun_SimpleFileExFlags ()
    at /tmp/build/80754af9/python_1573076469108/work/Python/pythonrun.c:428
#38 0x00005614dc5b2983 in pymain_run_file (cf=0x7ffd8ec06cb8, config=0x5614dc7fcc00)
    at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:381
#39 pymain_run_python (exitcode=0x7ffd8ec06cb0) at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:565
#40 Py_RunMain () at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:644
#41 0x00005614dc5b2bc9 in Py_BytesMain () at /tmp/build/80754af9/python_1573076469108/work/Modules/main.c:698
#42 0x00007f88a9a61c87 in __libc_start_main (main=0x5614dc46f720 <main>, argc=12, argv=0x7ffd8ec06ea8, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd8ec06e98)
    at ../csu/libc-start.c:310
#43 0x00005614dc53be13 in _start () at ../sysdeps/x86_64/elf/start.S:103

The line 44 corresponds to CHECK_EQ(ret, 0); The line 46 corresponds to CHECK(false);

molamooo commented 1 year ago

I don’t understand why this happens… Can you show me the code how you add the cout? Also, to avoid unnoticed glitches, I suggest you build with “./build.sh -c”, which cleans temp output first.

lwwlwwl commented 1 year ago
void *CPUDevice::AllocDataSpace(Context ctx, size_t nbytes, size_t alignment) {
  void *ptr;  std::cout << "value of ctx.device_id: " << ctx.device_id << std::endl;
  if (ctx.device_id == CPU_CUDA_HOST_MALLOC_DEVICE) {
    CUDA_CALL(cudaHostAlloc(&ptr, nbytes, cudaHostAllocDefault));
  } else if (ctx.device_id == CPU_CLIB_MALLOC_DEVICE) {
    int ret = posix_memalign(&ptr, alignment, nbytes);
    CHECK_EQ(ret, 0);
  } else {
    CHECK(false);
  }

  return ptr;
}

built with ./build.sh -c and bt frame 3 gave the same info

molamooo commented 1 year ago

I think I know. The else branch’s abort is combined with the second branch’s abort due to compiler optimization. That’s why the backtrace says line 46. The actual abort location is line 44, as the abort message reports, which I didn’t notice. The error should caused by allocating memory beyond capacity, since mag240 has significant large feature dimension.

molamooo commented 1 year ago

Original Mag240 use float16 feature while gnnlab only supports float32 by the moment. Extending gnnlab to support fp16 can leads to lower memory occupation. The simplest method is to half the dimension in meta.txt, and force overwrite the dtype to fp16 and double the dimension when passing extracted feature to torch in “operation.cc”.

lwwlwwl commented 1 year ago

Thanks! It is working now