erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.36k stars 2.95k forks source link

io:format changes the pids values to incorrect formats when used inside a remsh connection #6825

Open gonzalobf opened 1 year ago

gonzalobf commented 1 year ago

Describe the bug Sorry if this has been already reported but I couldn't find any bug about it.

I found this issue when I was accessing one of our servers with erl -remsh. I was trying to get the state of a process that contains a second process pid inside of the state. When I tried to use the pid printed from the state, I realised that the pid was starting with <8xxx.xxx.xxxx> and the erlang terminal complained about the format of the pid:

(prod@server)1> <8614.726.0>.
* 1:1: syntax error before: '<'

I suppose the node number changed because it was a remsh connection, but I was surprised that the pid was also considered invalid. Probably because the node number was too big?

I'm not sure if there is any other reason for having this behavior, but I thought it worth reporting it.

To Reproduce Start two terminals, the first one with a -sname.

erl -sname test@localhost 

In the second terminal, connect with remsh to the first one and use io:format to print some pids:

erl -remsh test@localhost
(test@localhost)1> [io:format("~p~n", [P]) || P <- erlang:processes()].
<8790.0.0>
<8790.1.0>
(test@localhost)2> erlang:processes().
[<0.0.0>,<0.1.0>,<0.2.0>,<0.3.0>,<0.4.0>,<0.5.0>,<0.6.0>,
 <0.7.0>,<0.10.0>,<0.42.0>,<0.44.0>,<0.46.0>,<0.47.0>,

Expected behavior I would expect the pid values to be printed with the same values as when asked to erlang:processes() or at least to have a valid format.

Affected versions At least master and OTP-25.2.2

Thank you

gonzalobf commented 1 year ago

(After posting the bug I got a notification about a failing github action https://github.com/erlang/otp/actions/runs/4126251562)

josevalim commented 1 year ago

~FWIW, I cannot reproduce this, neither in master nor in 25.2:~ (the snippet below is wrong)

$ bin/erl --sname bar --remsh foo
Erlang/OTP 26 [DEVELOPMENT] [erts-13.1.4] [source-e35cd70570] [64-bit] [smp:10:10] [ds:10:10:10] [async-threads:1] [jit]

Eshell V13.1.4 (press Ctrl+G to abort, type help(). for help)
1> erlang:processes().
[<0.0.0>,<0.1.0>,<0.2.0>,<0.3.0>,<0.4.0>,<0.5.0>,<0.6.0>,
 <0.7.0>,<0.10.0>,<0.42.0>,<0.44.0>,<0.46.0>,<0.47.0>,
 <0.49.0>,<0.50.0>,<0.51.0>,<0.52.0>,<0.53.0>,<0.54.0>,
 <0.55.0>,<0.56.0>,<0.57.0>,<0.58.0>,<0.59.0>,<0.60.0>,
 <0.61.0>,<0.62.0>,<0.63.0>,<0.64.0>|...]
2> [io:format("~p~n", [P]) || P <- erlang:processes()].
<0.0.0>
<0.1.0>
<0.2.0>
<0.3.0>
<0.4.0>
<0.5.0>
<0.6.0>
gonzalobf commented 1 year ago

@josevalim I'm not able to get it either with your commands. Would you mind to try with the ones I explained in the description? I don't know why it would be different, but I get different results.

I just double-check that I still getting the error. I compiled from master and this are the results:

$ bin/erl -sname test
Erlang/OTP 26 [DEVELOPMENT] [erts-13.1.4] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit:ns]

Eshell V13.1.4 (press Ctrl+G to abort, type help(). for help)
(test@localhost)1>
[gonzalo@precision otp]$ bin/erl -remsh test
Erlang/OTP 26 [DEVELOPMENT] [erts-13.1.4] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit:ns]

Eshell V13.1.4 (press Ctrl+G to abort, type help(). for help)
(test@localhost)1> [io:format("~p~n", [P]) || P <- erlang:processes()].
<9646.0.0>
<9646.1.0>
<9646.2.0>
<9646.3.0>
<9646.4.0>

I'm not sure if it is relevant but my OS are Ubuntu 20.04 and archlinux.

josevalim commented 1 year ago

Oh, this is totally my bad. I passed —sname instead of -sname and that obviously won’t work. Passing the proper flags I can reproduce it.

I think the issue is the following:

  1. io:format/2 will send the message to the group leader. The group leader, in this case, is in the connector node. Therefore, <8790.0.0> is how the connector node see the pids of the node you connected to (test@localhost)

  2. When you call erlang:processes(), then I assume the connected node (test@localhost) formats the pids locally and sends only the message to be printed to the connector node

You can fix io:format by forcing the actual formatting to happen on the connected node:

[io:format(io_lib:format("~p~n", [P])) || P <- erlang:processes()].

I think it is worth asking if there is a reason to delegate the formatting to the group leader instead of always formatting in the caller process. Perhaps this way less data has to be copied and you are less likely to overload the group leader with formatting work?

gonzalobf commented 1 year ago

No problem, thanks for looking at this. I'm not very familiar how the group leader works, but your explanation makes sense.

The only think I don't totally understand is why the pids are rejected to be accepted by erlang (like <8790.0.0>), maybe are numbers reserved for remsh connections?

josevalim commented 1 year ago

When you do -remsh test@localhost, there are actually two nodes. You don't see one of the nodes but it is there:

1> node(erlang:group_leader()).
'1HL4LUUGJUDHO@localhost'

The group leader is the one responsible for doing all I/O operations and, as you can see, it runs on the "hidden" connector node, not test@localhost.

My understanding is that <9646. is how test@localhost sees the processes in 1HL4LUUGJUDHO@localhost. <8790. is how 1HL4LUUGJUDHO@localhost sees the processes in test@localhost. So it doesn't work because, when you evaluate <8790., you are asking test@localhost to interpret 1HL4LUUGJUDHO@localhost's view of the world.

rickard-green commented 1 year ago

@josevalim's explanation is correct. We are not sure why io:format() forwards the formatting information to the group leader instead of formatting it locally, but the reason is probably that at least in some cases (or maybe as a preparation for the future) all information about how to format is not available locally, but is available at the group leader. These parts of the code were written a long time ago by people that do not work here anymore, so we need to dig into the code in more detail to give a better answer and to decide whether or not we should do any changes here. Currently we do not have the time to do that, so this will be a job for the future.

robertoaloi commented 1 year ago

I just noticed this behaviour as well. To me the most unexpected part is that a syntax error is returned based on the validity on the PID (or, rather, the part of the PID which identifies the remote node). While I can produce non-existing local or remote PIDs without issues, I cannot create a fake PID using a non-existing node identifier. In other words:

(bar@mbp)4> <10206.31212121212.0>.
<10206.1147350140.0>
%% But If I kill the remote node
(bar@mbp)6> <10206.31212121212.0>.
* 1:1: syntax error before: '<'

Receiving a syntax error based on the external environment was indeed surprising.