SIGSEGV when running nvbit on simpletransformer python program

accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.

https://accel-sim.github.io

Other

289 stars 110 forks source link

SIGSEGV when running nvbit on simpletransformer python program #258

Open Wen-Tian-Pineapple opened 9 months ago

Wen-Tian-Pineapple commented 9 months ago

Hello, I was trying to use accel-sim to evaluate python transformer model training/evaluting process.(The python file is showing below.)

And when I'm trying to run nvbit on this python program to get the trace, I encountered a segmentation fault and I used gdb to track it which looks like below.

Anyone has any idea about what the problem might be?(it's not because disk is full) Thanks!

JRPan commented 9 months ago

Maybe the mem is full?

Well, I mean looks like you got most of the traces and the inferencing is done. The kernels before 2209 should be complete and kernel-2099 may be complete. You probably won't be simulating the entire network so I guess you can just ignore that?

You need to manually run post-processing to process the traces tho. Check run.sh in the folder. You may want to delete kernel 2209 from the kernelslist as I'm not sure if the kernel is complete.

But we'll look into it. Thanks for reporting that!

Wen-Tian-Pineapple commented 9 months ago

@JRPan Thanks for the reply and you are right, Looks like there is a trace with size of 62G that makes RAM full and causing this segmentation fault. Not sure why is that trace so large compare to others since the training process only runs on 1kb of data. Maybe this trace consist all the weight value downloaded from online transformer model.

JRPan commented 9 months ago

No, traces does not include any data. It's just that kernel is large and we are saving all traces in plain text.

Wen-Tian-Pineapple commented 9 months ago

HMMM, then why is that particular trace so large while others are less than 200M? It's just a dummy(toy) transformer pytorch training program. Do you have any idea/instinct thought?

JRPan commented 9 months ago

This is completely normal. Some layers are larger, some layers are smaller. Some layers can be just ReLU. Different layers invoke different kernels. You can check the kernel name to guess what does it do.

But this is complete do. For example, a CNN kernel is much larger than the ReLU kernel.

Wen-Tian-Pineapple commented 9 months ago

Understood, I will definitely check, thanks for the insight!

rodhuega commented 9 months ago

I used an external HDD as swap space to achieve getting traces bigger than the RAM of the machine. It is going to be slower, but it works.

Wen-Tian-Pineapple commented 9 months ago

@JRPan Could please elaborate a little bit more on that or let me know where can I find those information(kernel name relationship with layer name)? I was seeing some kernel named vectorized_element which are small and some kernel named quantization which are large(10GB). So would the CNN kernel be named just CNN? I really appreciate the help! By the way, for training a ML model, since the program would do forward and backward propagation multiple times through a kernel, does it mean the same layer(maybe with similar kernel name) would appear multiple times in different traces?

Wen-Tian-Pineapple commented 9 months ago

@rodhuega, thanks for the insight! I was manage to make it work