ergo-services / ergo

An actor-based Framework with network transparency for creating event-driven architecture in Golang. Inspired by Erlang. Zero dependencies.
https://docs.ergo.services
MIT License
3.51k stars 138 forks source link

Compilation to arm fails #102

Closed finalclass closed 2 years ago

finalclass commented 2 years ago

Describe the bug Compiling to ARM does not work

To Reproduce

$ GOARCH=arm go build

Expected behavior Compilation works fine

Actual behaviour This error is displayed:

# github.com/ergo-services/ergo/lib/osdep
../../go/pkg/mod/github.com/ergo-services/ergo@v1.999.210/lib/osdep/linux.go:15:11: invalid operation: usage.Utime.Sec * 1000000000 + usage.Utime.Nano() (mismatched types int32 and int64)
../../go/pkg/mod/github.com/ergo-services/ergo@v1.999.210/lib/osdep/linux.go:16:11: invalid operation: usage.Stime.Sec * 1000000000 + usage.Stime.Nano() (mismatched types int32 and int64)
# github.com/ergo-services/ergo/lib
../../go/pkg/mod/github.com/ergo-services/ergo@v1.999.210/lib/tools.go:166:11: cannot use 4294967000 (untyped int constant) as int value in assignment (overflows)

Environment (please complete the following information):

Additional context Removing GOARCH fixes it however I need to run few services on ARM

halturin commented 2 years ago

Thanks for the report. I have tested it on arm64 only. Will take a look

finalclass commented 2 years ago

Thanks for your response. Indeed it compiles fine for arm64. However when I try to run it on Asus Tinker Edge T it hangs and after a while the asus shuts off. To reproduce:

cd ergo/examples/simple
GOARCH=arm64 go build simple.go
// scp to arm64 device

on the arm64 device:

./simple

The result: it hangs. I have to restart the device. Occasionally I get this error log:

Message from syslogd@vexing-eft at Apr 22 08:13:22 ...
 kernel:[   64.830251] Internal error: undefined instruction: 0 [#1] PREEMPT SMP

Message from syslogd@vexing-eft at Apr 22 08:13:22 ...
 kernel:[   64.964868] Process simple (pid: 4048, stack limit = 0xffff000013498000)

Message from syslogd@vexing-eft at Apr 22 08:13:22 ...
 kernel:[   65.199004] Code: d2800014 54000be1 b9404004 f9401c03 (23232323)

However usually there is no log at all.

I debugged it a little and found that it hangs on this line: https://github.com/ergo-services/ergo/blob/79bebaa/proto/dist/resolver.go#L225 However the value of the dsn variable seams to be correct: localhost:4369

On my local PC it works fine (but I have erlang installed) but on the arm machine it does not. Could you give me some clues, I would really appreciate it.

halturin commented 2 years ago

To be specific, I have tested it on aarch64 and all tests passed. I will try to find the same hardware. It's a bit difficult to resolve it with no access to the similar environment.

On Fri, Apr 22, 2022, 10:32 Szymon Wygnański @.***> wrote:

Thanks for your response. Indeed it compiles fine for arm64. However when I try to run it on Asus Tinker Board T it hangs and after a while the asus shuts off. To reproduce:

cd ergo/examples/simple GOARCH=arm64 go build simple.go // scp to arm64 device

on the arm64 device:

./simple

The result: it hangs. I have to restart the device. Occasionally I get this error log:

Message from @.*** at Apr 22 08:13:22 ... kernel:[ 64.830251] Internal error: undefined instruction: 0 [#1] PREEMPT SMP

Message from @.*** at Apr 22 08:13:22 ... kernel:[ 64.964868] Process simple (pid: 4048, stack limit = 0xffff000013498000)

Message from @.*** at Apr 22 08:13:22 ... kernel:[ 65.199004] Code: d2800014 54000be1 b9404004 f9401c03 (23232323)

However usually there is no log at all.

I debugged it a little and found that it hangs on this line: https://github.com/ergo-services/ergo/blob/79bebaa/proto/dist/resolver.go#L225 However the value of the dsn variable seams to be correct: localhost:4369

On my local PC it works fine (but I have erlang installed) but on the arm machine it does not. Could you give me some clues, I would really appreciate it.

— Reply to this email directly, view it on GitHub https://github.com/ergo-services/ergo/issues/102#issuecomment-1106189951, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5ATFEMHDT77JI5ZYZS5LVGJP2ZANCNFSM5T7VMSSQ . You are receiving this because you commented.Message ID: @.***>

halturin commented 2 years ago

Sorry for the delay. I still don't have anything similar to your device.

May I ask you to try these fixes: /ergo-services/ergo@v1.999.210/lib/tools.go:166 replace:

  limit = 4294967000

by

  limit = math.MaxInt

it also requires to add "math" module to the import section.

And update ResourceUsage function in ergo-services/ergo@v1.999.210/lib/osdep/linux.go with this code

    var usage syscall.Rusage
    var utime, time int64
    if err := syscall.Getrusage(syscall.RUSAGE_SELF, &usage); err == nil {
        utime = int64(usage.Utime.Sec)*1000000000 + usage.Utime.Nano()
        stime = int64(usage.Stime.Sec)*1000000000 + usage.Stime.Nano()
    }
    return utime, stime

Or you can try this branch https://github.com/ergo-services/ergo/tree/fixarm

finalclass commented 2 years ago

Sure I will try it out. However I'm OOO currently and will be able to check it early next week.

finalclass commented 2 years ago

Hi, Unfortunately this does not seam to help. I've added go.mod to ergo/examples/simple:

module simpl.com/simple

go 1.18

replace github.com/ergo-services/ergo => ../../
replace github.com/ergo-services/ergo/etf => ../../etf
replace github.com/ergo-services/ergo/gen => ../../gen
replace github.com/ergo-services/ergo/node => ../../node
require github.com/ergo-services/ergo v1.999.211 // indirect

and then I've made the fixes you've mentioned. There was one mistake on line:

var utime, time int64

I assume it should be:

var utime, stime int64

I've built everything and ran the script again but it hanged the same way it did last time.

halturin commented 2 years ago

so there were a few issues

It seems I need to get the same HW (or a similar one) or a way to run it somehow in the VM (I have no clue how to do this so far).

finalclass commented 2 years ago

Yes the compilation issue has been fixed (actually I was compiling to the wrong target: it should be arm64, not just arm). The running part appears to still be a problem. Let me know if you wish me to test anything else on the device.

halturin commented 2 years ago

I finally bought this device. Waiting for the shipment.

halturin commented 2 years ago

I've just tested simple example on my tinker board. No issues (master branch) image

Could you please check on your board the same way? You may also want to add -ergo.trace to see extra debug info. Like this...

mendel@tinker:~$ ./simple -ergo.trace
2022/07/27 10:09:15 Start node with name "node@localhost" and cookie "cookies"
2022/07/27 10:09:15 Node listening range: 15000...65000
2022/07/27 10:09:15 Started embedded EMPD service and listen port: 4369
2022/07/27 10:09:15 EPMD accepted new connection from [::1]:46830
2022/07/27 10:09:15 Request from EPMD client: [0 22 120 58 152 77 0 0 6 0 5 0 4 110 111 100 101 0 4 17 59 1 0 0]
2022/07/27 10:09:15 [node@localhost] EPMD client: node registered
2022/07/27 10:09:15 [node@localhost] CORE registering behavior "erlang" in group "ergo:applications"
2022/07/27 10:09:15 [node@localhost] CORE registering process: <2C323E75.0.1001>
2022/07/27 10:09:15 [node@localhost] CORE spawn a new process <2C323E75.0.1001> (registered name: "")
2022/07/27 10:09:15 [node@localhost] CORE registering name (<2C323E75.0.1002>): net_kernel_sup
2022/07/27 10:09:15 [node@localhost] CORE registering process: <2C323E75.0.1002>
2022/07/27 10:09:15 [node@localhost] CORE spawn a new process <2C323E75.0.1002> (registered name: "net_kernel_sup")
2022/07/27 10:09:15 [node@localhost] SUPERVISOR "net_kernel_sup" with restart strategy: one_for_one[permanent]
2022/07/27 10:09:15 [node@localhost] CORE registering name (<2C323E75.0.1003>): net_kernel
2022/07/27 10:09:15 [node@localhost] CORE registering process: <2C323E75.0.1003>
2022/07/27 10:09:15 [node@localhost] CORE spawn a new process <2C323E75.0.1003> (registered name: "net_kernel")
2022/07/27 10:09:15 NET_KERNEL: Init: []etf.Term(nil)
2022/07/27 10:09:15 [node@localhost] LINK process: <2C323E75.0.1002> => <2C323E75.0.1003>
2022/07/27 10:09:15 [node@localhost] CORE registering name (<2C323E75.0.1004>): global_name_server
2022/07/27 10:09:15 [node@localhost] CORE registering process: <2C323E75.0.1004>
2022/07/27 10:09:15 [node@localhost] CORE spawn a new process <2C323E75.0.1004> (registered name: "global_name_server")
2022/07/27 10:09:15 [node@localhost] LINK process: <2C323E75.0.1002> => <2C323E75.0.1004>
2022/07/27 10:09:15 [node@localhost] CORE registering name (<2C323E75.0.1005>): rex
2022/07/27 10:09:15 [node@localhost] CORE registering process: <2C323E75.0.1005>
2022/07/27 10:09:15 [node@localhost] CORE spawn a new process <2C323E75.0.1005> (registered name: "rex")
2022/07/27 10:09:15 REX: Init: []etf.Term(nil)
2022/07/27 10:09:15 [node@localhost] CORE registering behavior "erpc" in group "ergo:remote"
2022/07/27 10:09:15 [node@localhost] LINK process: <2C323E75.0.1002> => <2C323E75.0.1005>
2022/07/27 10:09:15 [node@localhost] CORE registering name (<2C323E75.0.1006>): observer_backend
2022/07/27 10:09:15 [node@localhost] CORE registering process: <2C323E75.0.1006>
2022/07/27 10:09:15 [node@localhost] CORE spawn a new process <2C323E75.0.1006> (registered name: "observer_backend")
2022/07/27 10:09:15 OBSERVER: Init: []etf.Term(nil)
2022/07/27 10:09:15 [node@localhost] RPC provide: proc_lib:translate_initial_call (gen.RPC)(0x1fc500)
2022/07/27 10:09:15 [node@localhost] RPC provide: appmon_info:start_link2 (gen.RPC)(0x1fc740)
2022/07/27 10:09:15 [node@localhost] LINK process: <2C323E75.0.1002> => <2C323E75.0.1006>
2022/07/27 10:09:15 [node@localhost] CORE registering name (<2C323E75.0.1007>): erlang
2022/07/27 10:09:15 [node@localhost] CORE registering process: <2C323E75.0.1007>
2022/07/27 10:09:15 [node@localhost] CORE spawn a new process <2C323E75.0.1007> (registered name: "erlang")
2022/07/27 10:09:15 ERLANG: Init: []etf.Term(nil)
2022/07/27 10:09:15 [node@localhost] LINK process: <2C323E75.0.1002> => <2C323E75.0.1007>
2022/07/27 10:09:15 [node@localhost] LINK process: <2C323E75.0.1001> => <2C323E75.0.1002>
2022/07/27 10:09:15 [node@localhost] CORE registering name (<2C323E75.0.1008>): gs1
2022/07/27 10:09:15 [node@localhost] CORE registering process: <2C323E75.0.1008>
2022/07/27 10:09:15 [node@localhost] CORE spawn a new process <2C323E75.0.1008> (registered name: "gs1")
2022/07/27 10:09:15 [node@localhost] CORE route message by pid (local) <2C323E75.0.1008>
2022/07/27 10:09:15 [node@localhost] GEN_SERVER <2C323E75.0.1008> got message from <2C323E75.0.1008>
2022/07/27 10:09:15 m: 100
HandleInfo: 100
2022/07/27 10:09:16 [node@localhost] CORE route message by pid (local) <2C323E75.0.1008>
2022/07/27 10:09:16 [node@localhost] GEN_SERVER <2C323E75.0.1008> got message from <2C323E75.0.1008>
2022/07/27 10:09:16 m: 101
HandleInfo: 101
2022/07/27 10:09:17 [node@localhost] CORE route message by pid (local) <2C323E75.0.1008>
2022/07/27 10:09:17 [node@localhost] GEN_SERVER <2C323E75.0.1008> got message from <2C323E75.0.1008>
2022/07/27 10:09:17 m: 102
HandleInfo: 102
2022/07/27 10:09:18 [node@localhost] CORE route message by pid (local) <2C323E75.0.1008>
2022/07/27 10:09:18 [node@localhost] GEN_SERVER <2C323E75.0.1008> got message from <2C323E75.0.1008>
2022/07/27 10:09:18 m: 103
HandleInfo: 103
2022/07/27 10:09:19 [node@localhost] CORE route message by pid (local) <2C323E75.0.1008>
2022/07/27 10:09:19 [node@localhost] GEN_SERVER <2C323E75.0.1008> got message from <2C323E75.0.1008>
2022/07/27 10:09:19 m: 104
HandleInfo: 104
2022/07/27 10:09:20 [node@localhost] CORE route message by pid (local) <2C323E75.0.1008>
2022/07/27 10:09:20 [node@localhost] GEN_SERVER <2C323E75.0.1008> got message from <2C323E75.0.1008>
2022/07/27 10:09:20 m: 105
HandleInfo: 105
2022/07/27 10:09:20 [node@localhost] CORE unregistering process: <2C323E75.0.1008>
2022/07/27 10:09:20 [node@localhost] CORE unregistering name (<2C323E75.0.1008>): gs1
2022/07/27 10:09:20 [node@localhost] MONITOR process terminated: <2C323E75.0.1008>
exited
2022/07/27 10:09:20 accept tcp 127.0.0.1:15000: use of closed network connection
mendel@tinker:~$
finalclass commented 2 years ago

Sorry for the late answer. That's quite strange becuase for me it still does not work. I don't have any customizations on the device, just a bare tinker os installation.

$ uname -a
Linux vexing-eft 4.14.98-imx #1 SMP PREEMPT Wed Jun 9 15:32:53 UTC 2021 aarch64 GNU/Linux

image

It seams that we even have the same kernel versions (only the compilation time is different).

halturin commented 2 years ago

what version of golang are you using? OS, distro?

PS: forgot to mention. I've updated simple.go by adding flag.Parse() in order to use -ergo.trace PPS: you may contact me via telegram halturin for instant messaging.

halturin commented 2 years ago

just to make sure I've tested one more time (recently updated my OS environment) image

finalclass commented 2 years ago

I'm compiling it on:

$ go version
go version go1.19 linux/amd64
$ uname -a
Linux rog 5.15.59-1-MANJARO #1 SMP PREEMPT Wed Aug 3 11:20:04 UTC 2022 x86_64 GNU/Linux
halturin commented 2 years ago

May I ask you to build it using golang 1.18?

finalclass commented 2 years ago

Unfortunately it's the same.

image

halturin commented 2 years ago

could you please add flag.Parse() into the main function of simple.go and run it with -ergo.trace on this board?

halturin commented 2 years ago

btw, is this hostname (or localhost) present in the /ets/hosts?

finalclass commented 2 years ago

image

image

halturin commented 2 years ago

thanks for the quick reply. I'll take a look at why the node couldn't connect to itself on the 4369. PS: do you have any firewall settings on this board?

halturin commented 2 years ago

and the last thing I would like to ask you... could you please try the v220 branch? It's got some changes in the network module.

finalclass commented 2 years ago

This got me thinking. If it hangs on trying to connect to itself then a simple http server should also be broken:

image

So I guess you can close this issue because it's clearly not related to the ergo library.

finalclass commented 2 years ago

There is one difference: with the http server it does not hang (the server). I can close it with Ctrl+C. When it comes to curl it hangs.

halturin commented 2 years ago

I guess there must be a firewall setting that drops any incoming connections. You may check this out with iptables -L and iptables -F to flush them out. Otherwise, I have no clue what it could be. Looks weird.

finalclass commented 2 years ago

It seams that it's not a firewall

image

When I find some time I will try to reinstall the system.

halturin commented 2 years ago

Since this bug is not related to ergo I'm closing it.