dfskoll / rp-pppoe

Public repository for RP-PPPoE PPPoE client and server software
https://dianne.skoll.ca/projects/rp-pppoe/
54 stars 18 forks source link

syncReadfromPPP: Value too large for defined data type error in user level pppoe #32

Closed mark-dawn closed 5 months ago

mark-dawn commented 5 months ago

Hello, I woild like to report a weird regression/bug (hopefully github is the right place, I saw more activity here than on the debian repo) with recent versions of pppoe in userspace:

For example for my isp connection I have this config:

pty '/usr/sbin/pppoe -s -I enp3s0.835' sync
noauth
# rest of the file

that will produce this result:

$ sudo pppd call myisp
Using interface ppp0
Connect: ppp0 <--> /dev/pts/3
CHAP authentication succeeded
CHAP authentication succeeded
local  IP address -
remote IP address -
primary   DNS address -
secondary DNS address -
pppoe: read (syncReadFromPPP): Session 58587: Value too large for defined data type
Modem hangup
Connect time 0.1 minutes.
Sent 5110 bytes, received 1780 bytes.
Connection terminated.
Using interface ppp0
Connect: ppp0 <--> /dev/pts/3
^CTerminating on signal 2
Modem hangup
Connection terminated.

This seems indipendent from my ISP being stupid or some problem specific of routing through a vlan, as other users have reported the same problem: https://bbs.archlinux.org/viewtopic.php?pid=2155419#p2155419

Incidentally if I switch to the kernel module by doing:

plugin /usr/lib/rp-pppoe/rp-pppoe.so nic-enp3s0.835
noauth
# rest of the file

everything seems to work fine and indeed I get a nice stable connection:

$ sudo pppd call myisp
Plugin /usr/lib/rp-pppoe/rp-pppoe.so loaded.
RP-PPPoE plugin version 4.0 compiled against pppd 2.5.0
Connected to -snip- via interface enp3s0.835
Using interface ppp0
Connect: ppp0 <--> enp3s0.835
CHAP authentication succeeded
CHAP authentication succeeded
peer from calling number -snip- authorized
local  IP address -
remote IP address -
primary   DNS address -
secondary DNS address -
local  LL address -
remote LL address -
[everything fine from here on]

I tried having a look through the code and the error seems unequivocally generated from here and since it's picking up a syscall error over type sizes, perhaps it is some weird interaction with more recent Linux kernels, or possibly a miscompilation on the part of arch maintainers?

dfskoll commented 5 months ago

Huh, this is weird. That error implies that the read system call right here:

https://github.com/dfskoll/rp-pppoe/blob/6a9e87952494040483a3266582c00b759966981f/src/ppp.c#L107

is returning -EOVERFLOW which isn't specifically documented in the read(2) man page.

Is there any way you can run the pppoe program under strace and obtain a system call trace?

dfskoll commented 5 months ago

Ah... hmm... ok, perusing the Linux kernel source, it seems like the kernel wants to return more data than our buffer is asking for. Is your ISP using very large Ethernet packets, larger than the standard 1500 bytes? That could be the problem.

In pppoe.h try changing the line that currently reads:

#define ETH_JUMBO_LEN (ETH_DATA_LEN+8)

to

#define ETH_JUMBO_LEN 9000

and recompiling. This is just a wild untested idea and I don't know if it will have any negative effects elsewhere.

dfskoll commented 5 months ago

Uh... another thing to try, rather than recompiling pppoe. If you put these options in the ppp config file, does it fix the problem?

mru 1500
mtu 1500
mark-dawn commented 5 months ago

Thanks for the quick reply, I will run some tests after work tonight. Curiously I have mru and mtu set in my config (and similarly other people on the forum) but both under 1500 (tested from 1490 down to 1400) to account for the ppp headers, so perhaps that's the problem. I will test both without specifying one and trying the big values then, and then try by strace and recompiling as well.

dfskoll commented 5 months ago

Hmm. The more I look at this, the more I think it might be a kernel bug.

mark-dawn commented 5 months ago

So, quick rundown:

  1. Tried with no mru/mtu option set
  2. Tried with mru/mtu set to 1500
  3. Tried with mru/mtu > 1500

All lead to the same crash. I then straced the process with strace -vT -o strace_pppoe -ff pppd call myisp, and I found the moment in one of the subprocess:

[...more lines...]
read(0, "\377\3\0!E\0\0hN/@\0@\6\3729dB\200\274\300\0M(\262\346\1\273V3\234\307"..., 1504) = 108 <0.000005>
sendto(4, "l1\16\322\361\304D\212[\237R\313\210d\21\0\372\v\0j\0!E\0\0hN/@\0@\6"..., 126, 0, NULL, 0) = 126 <0.000008>
pselect6(6, [0 4 5], NULL, NULL, NULL, NULL) = 1 (in [0]) <0.000264>
read(0, "\377\3\0!E\0\0\322N0@\0@\6\371\316dB\200\274\300\0M(\262\346\1\273V3\235\7"..., 1504) = 214 <0.000009>
sendto(4, "l1\16\322\361\304D\212[\237R\313\210d\21\0\372\v\0\324\0!E\0\0\322N0@\0@\6"..., 232, 0, NULL, 0) = 232 <0.000012>
pselect6(6, [0 4 5], NULL, NULL, NULL, NULL) = 1 (in [0]) <0.000009>
read(0, 0x7ffc1a090a10, 1504)           = -1 EOVERFLOW (Value too large for defined data type) <0.000008>
write(2, "pppoe: read (syncReadFromPPP): S"..., 84) = 84 <0.000011>
getpid()                                = 89259 <0.000006>
sendto(3, "<27>Apr 23 22:56:36 pppoe[89259]"..., 110, MSG_NOSIGNAL, NULL, 0) = 110 <0.000016>
sendto(5, "l1\16\322\361\304D\212[\237R\313\210c\21\247\372\v\0F\2\3\0BRP-PPPoE"..., 90, 0, NULL, 0) = 90 <0.000011>
getpid()                                = 89259 <0.000006>
sendto(3, "<30>Apr 23 22:56:36 pppoe[89259]"..., 43, MSG_NOSIGNAL, NULL, 0) = 43 <0.000009>
exit_group(1)                           = ?
+++ exited with 1 +++

Not entirely sure why that read call doesn't resolve to a char buffer like the others

dfskoll commented 5 months ago

I can't see how a read with a non-zero length is ever suppose to return -EOVERFLOW. Maybe the reason the address of the buffer is printed instead of a string is that the first byte happens to be a zero? Not sure.

I'll keep looking, but I think it might be a bug in the kernel, in the drivers/tty/n_hdlc.c file.

mark-dawn commented 5 months ago

I see an -EOVERFLOW there indeed in the n_hdlc_tty_read function, tho I suppose we're not supposed to ever hit it?

EDIT: upon further staring at the code EOVERFLOW can only be returned if the nr in parameter is 0, so the bug must be in the caller, or maybe I am missing some subtle footgun Digging deeper the line disciple is called by iterate_tty_read

static ssize_t iterate_tty_read(struct tty_ldisc *ld, struct tty_struct *tty,
                struct file *file, struct iov_iter *to)
{
    void *cookie = NULL;
    unsigned long offset = 0;
    ssize_t retval = 0;
    size_t copied, count = iov_iter_count(to);
    u8 kernel_buf[64];

    do {
        ssize_t size = min(count, sizeof(kernel_buf));

        size = ld->ops->read(tty, file, kernel_buf, size, &cookie, offset);

size must be somehow set to 0, since sizeof(kernel_buf) is a fixed positive integer the bug must be in iov_iter_countor theiov_iter is already incorrect when this function is called by tty_read

dfskoll commented 5 months ago

As far as I can tell, nr is passed in as the size argument from read, so it's 1504 in this case. The only way it can be reset to zero AFAIK is in the caller code, in tty_io.c where there's this bit of code inside a loop:

        if (unlikely(copied != size)) {
            count = 0;
            retval = -EFAULT;
        }

and in the next loop iteration, if cookie is non-NULL, n_hdlc_tty_read could get called with an nr of zero. This will free up the cookie but overwrite the -EFAULT error with an -EOVERFLOW return instead.

If this is what's happening, it implies that the buf argument passed to read in the rp-pppoe source is invalid, which I guess is hinted at by your strace output, but I don't see how that's possible. It's a local stack variable declared right before the read.

I wonder if changing the line in rp-ppoe's ppp.c file from:

unsigned char buf[ETH_JUMBO_LEN - PPPOE_OVERHEAD + 2]

to

static unsigned char buf[ETH_JUMBO_LEN - PPPOE_OVERHEAD + 2]

will make a difference? Can't see why it would, but maybe worth a shot?

mark-dawn commented 5 months ago

I noticed as well that the whole keep-cookie/read-more dance in the kernel could get tripped up if nr is 0 but cookie is not set to NULL yet, but I was trying to figure out if the offset math could ever run into an edge case.

I will recompile rp-pppoe with your suggestion and see if something changes, or else it might possibly be a kernel bug in there. I will let you know asap

mark-dawn commented 5 months ago

Something is weird alright for sure:

read(0, "\377\3\0!E\0\0\336\335\354@\0@\6q\330dB\200\274\214Ry\4\257\320\1\273b\21u\305"..., 1504) = 226 <0.000006>
sendto(4, "|!\16\276tCD\212[\237R\313\210d\21\0V\247\0\340\0!E\0\0\336\335\354@\0@\6"..., 244, 0, NULL, 0) = 244 <0.000009>
pselect6(6, [0 4 5], NULL, NULL, NULL, NULL) = 1 (in [0]) <0.000007>
read(0, 0x57986cf1c240, 1504)           = -1 EOVERFLOW (Value too large for defined data type) <0.000007>
write(2, "pppoe: read (syncReadFromPPP): S"..., 84) = 84 <0.000010>
getpid()                                = 145162 <0.000010>
sendto(3, "<27>Apr 24 00:58:04 pppoe[145162"..., 111, MSG_NOSIGNAL, NULL, 0) = 111 <0.000016>
sendto(5, "|!\16\276tCD\212[\237R\313\210c\21\247V\247\0F\2\3\0BRP-PPPoE"..., 90, 0, NULL, 0) = 90 <0.000010>
getpid()                                = 145162 <0.000006>
sendto(3, "<30>Apr 24 00:58:04 pppoe[145162"..., 44, MSG_NOSIGNAL, NULL, 0) = 44 <0.000008>
exit_group(1)                           = ?
+++ exited with 1 +++

added static in fromt buf, same corrupt buffer result. This makes no sense.

dfskoll commented 5 months ago

Huh. Well, the "good" news is I can replicate the bug. I have some LXC machines I use to test rp-pppoe and I see the bug happening.

As a workaround, you can use asynchronous PPP (though if you can use kernel-mode PPP insterad, that's way more efficient.) Just leave out the sync pppd option and the -s pppoe flag.

mark-dawn commented 5 months ago

Oh no problem on that front, I was using the kernel module now anyway, tho I am curious now on what is going on with the bug. Ended up learning something about the kernel of all things. Feel free to ping here when you figure out/release a patch.

Thanks for the great tool and support btw, people like you make free software great

dfskoll commented 5 months ago

I increased the buffer size and also the mtu on my ethernet ports and it started working. Some digging seems to reveal that the read from the TTY is actually trying to return two frames with a single read and that's never supposed to happen. So I think it's a kernel bug, possible in the ppp subsystem, or the tty subsystem, and at this point I'm out of my depth.

I don't know where to report this, but if you find an appropriate place, I'm happy to chime in with my findings.

Regards,

Dianne.

jkroonza commented 5 months ago

@dfskoll can you point at where in the kernel?

read(0, 0x57986cf1c240, 1504) = -1 EOVERFLOW (Value too large for defined data type) <0.000007>

So we're giving a pointer and a buffer size. How do we get to max 1504? I'm having to assume that this value really depends on the negotiated MRU and MTU values? So max write (to ppp) is MRU + PPP_OVERHEAD (4 bytes?) and read (from ppp) should be MTU + PPP_OVERHEAD (usually MTU and MRU should be same but I've seen some strange things over the last two years). If there's JUMBO (>1500 bytes) involved we may run into problems.

Does the read() from the network side - does this include the ethernet header? Because in THAT case we are definitely too small here, we should then have a buffer of at least 1518 bytes (to just cover a single 802.11q tag, in complex networks I've seen this stacked significantly higher, fortunately vxlans are starting to take over those use-cases, but even then our servers are often receiving double-tagged PPPOE frames - and we had to do strange things like manipulating MTU settings on various VLAN layers to get those to successfully come in).

I think (you probably can't assume I recall correctly on this, it was from a crazy busy period in my live) ppp-project have basically said they're abandoning user-space implementation anyway. So I suspect rp-pppoe may just as well follow suite.

mark-dawn commented 5 months ago

One thing to note is that users on the arch forums have reported that old versions of ppp and rp-pppoe work fine on the same connections:

I also tried many settings of mtu also but none of them were stable. I've reverted to using the previous versions. of ppp (2.4.9-3) and rp-pppoe (3.15-2). Eventually it needs fixing since partial upgrades aren't supported and sooner or later some support library will break the packages.

So this might be a change between ppp and rp-pppoe, and since the section of the code that triggers the bug is unchanged from 3 years ago I suppose the bug/change might originate on the ppp side? I wonder if compiling current rp-pppoe against an older version of ppp will produce a working program?

dfskoll commented 5 months ago

One thing to note is that users on the arch forums have reported that old versions of ppp and rp-pppoe work fine on the same connections:

Hmm. I'm testing with an older version of ppp (2.4.5) and the latest rp-pppoe. I'll try with rp-pppoe 3.15 and see if that fixes it, though the ppp.c file in rp-pppoe hasn't changed in ages, probably decades.

Regards,

Dianne.

dfskoll commented 5 months ago

Nope, for me it still breaks with ppp 2.4.5 and rp-pppoe 3.15.

dfskoll commented 5 months ago

@dfskoll can you point at where in the kernel?

read(0, 0x57986cf1c240, 1504) = -1 EOVERFLOW (Value too large for defined data type)

It appears to be here

So we're giving a pointer and a buffer size. How do we get to max 1504?

rp-pppoe does not support an MTU on the PPP interface greater than 1500; that's a hard limit. So yes, it's 1500 + PPP_OVERHEAD.

Does the read() from the network side - does this include the ethernet header?

The read from the network side happens in receivePacket and it reads up to sizeof(PPPoEPacket) bytes, which happens to be 1528 bytes.

Because in THAT case we are definitely too small here, we should then have a buffer of at least 1518 bytes (to just cover a single 802.11q tag,

rp-pppoe knows nothing about VLANs. It assumes it's using a virtual Ethernet interface where the VLAN tags have already been stripped.

I think (you probably can't assume I recall correctly on this, it was from a crazy busy period in my live) ppp-project have basically said they're abandoning user-space implementation anyway. So I suspect rp-pppoe may just as well follow suite.

Yes, ppp-project has abandoned user-space, but there is at least one use-case where it's needed: In a microcontroller without an MMU running ucLinux, dlopen() is not supported, so the rp-pppoe.so plugin cannot be used and you have to use the user-space implementation. I was told this on the mailing list when I proposed abandoning the user-space implementation. (I guess you could probably mangle the ppp code to build rp-pppoe support directly in rather than via a plugin.)

Anyway, since it's broken on kernel 6.8.7 with ppp 2.4.5 and rp-pppoe 3.15, and I'm pretty sure it used to work, my bet is on a kernel bug (or kernel change, at any rate.)

Regards,

Dianne.

mark-dawn commented 5 months ago

Afaik it should be a kernel problem (?) in the sense that in a pty interface reads are not necessarily synchronized with writes right? so if there is any buffering rp-pppoe will receive a bigger (or even smaller but it seems this is not happening) chunk than expected.

If I am reading the code correctly asyncReadFromPPP takes care of doing this piece by piece and this is why it (should, I haven't tested it yet) still works. Considering that the sync option is both passed to user rp-pppoe and pppd I suspect there is some shenanigans to fix the synchronization problem, but it might have been broken in some kernel revision.

Not entirely sure on how the kernel plugin work, I suspect there the synchronization could be a guarantee but that's not broken anyway

Hopefully this is kind of correct because I am not exactly a kernel / tty expert by any means

dfskoll commented 5 months ago

Afaik it should be a kernel problem (?) in the sense that in a pty interface reads are not necessarily synchronized with writes right?

Yes, but the N_HDLC line discipline is supposed to make sure that a given read receives at most one frame. The write calls are considered atomic and boundaries between them are remembered. So if one end writes 1024 bytes and then 900 bytes to the tty, the first read will get at most 1024 bytes even if it specifies a huge buffer size, and the second will get 900.

For some reason, this doesn't seem to be happening and the tty is trying to return two frames in one read.

If I am reading the code correctly asyncReadFromPPP takes care of doing this piece by piece

Correct. That doesn't use the N_HDLC line discipline, so it reads character-by-character and looks for the PPP framing bytes (0x7E) to delimit the frames.

Not entirely sure on how the kernel plugin work, I suspect there the synchronization could be a guarantee but that's not broken anyway

The kernel plugin works completely differently and doesn't involve a tty of any kind, so the code returning -EOVERFLOW isn't involved at all.

Regards,

Dianne.

mark-dawn commented 5 months ago

Thanks for the explanation, I see now why it should be a kernel bug, given the function of the line discipline. I am not sure on how to escalate this to the maintainers. I will look into perhaps making a small test program first to see if it is triggerable outside the rp-pppoe/ppd context.

jkroonza commented 5 months ago

@dfskoll can you point at where in the kernel?

read(0, 0x57986cf1c240, 1504) = -1 EOVERFLOW (Value too large for defined data type)

It appears to be here

commit 3b830a9c34d5897be07176ce4e6f2d75e2c8cfd7 Author: Linus Torvalds torvalds@linux-foundation.org Date: Mon Jan 18 13:31:30 2021 -0800

The relevant commit which last touched that code, what kernel versions are you testing on that provokes this error? Is there any known kernel version that does NOT provoke this?

I can't find that the function will try to pack multiple read buffers into a single buffer, unless the ppp side of the tty will pack multiple IP frames into a single buffer for write(), and that part may well happen in-kernel.

That said, it's possible that the "This is a conversion to use a kernel buffer instead." may have started to coalesce multiple buffers?

Having looked at that commit as a whole, it's clear that nr==0, which I don't get how that can be possible.

Can you verify the "multiple packets/read theory?"

So we're giving a pointer and a buffer size. How do we get to max 1504?

rp-pppoe does not support an MTU on the PPP interface greater than 1500; that's a hard limit. So yes, it's 1500 + PPP_OVERHEAD.

OK perfect.

Does the read() from the network side - does this include the ethernet header?

The read from the network side happens in receivePacket and it reads up to sizeof(PPPoEPacket) bytes, which happens to be 1528 bytes.

The extra 14 bytes is your +8 for ETH_JUMBO_LEN, and the 6 bytes of PPPoE data in PPPoEPacketStruct which really is part of the ethernet payload. I suspect this is a miscalc, but frankly, in the right direction, so this won't cause harm.

Because in THAT case we are definitely too small here, we should then have a buffer of at least 1518 bytes (to just cover a single 802.11q tag,

rp-pppoe knows nothing about VLANs. It assumes it's using a virtual Ethernet interface where the VLAN tags have already been stripped.

OK. So we only ever see the ethernet header (which on the wire includes the VLAN tags @ 4 bytes each, effective), but these should be stripped as the vlan interfaces gets traversed through the kernel. Thus MAX read ever should probably be 1514 based on what you're saying. Plus as you say, we don't ever see those VLAN headers so it doesn't matter. This is in any case on the ethernet side, not the tty side.

I think (you probably can't assume I recall correctly on this, it was from a crazy busy period in my live) ppp-project have basically said they're abandoning user-space implementation anyway. So I suspect rp-pppoe may just as well follow suite.

Yes, ppp-project has abandoned user-space, but there is at least one use-case where it's needed: In a microcontroller without an MMU running ucLinux, dlopen() is not supported, so the rp-pppoe.so plugin cannot be used and you have to use the user-space implementation. I was told this on the mailing list when I proposed abandoning the user-space implementation. (I guess you could probably mangle the ppp code to build rp-pppoe support directly in rather than via a plugin.)

I was not aware of that. Ok, so we need to figure this out.

Anyway, since it's broken on kernel 6.8.7 with ppp 2.4.5 and rp-pppoe 3.15, and I'm pretty sure it used to work, my bet is on a kernel bug (or kernel change, at any rate.)

Ok, I suspect that's POST the 2021 commit, so can we verify that it works with some kernel released prior to that date, and fails with some kernel shortly after?

Then we can take it to the net-dev mailing list and ask for help. So far in my experience the people there have been most helpful.

I just make note that we're seeing sync vs async makes a difference, so does "sync" perhaps imply that all data is serialized into a single stream? Although, this doesn't explain nr==0 either does it? Whereas async still implies separate buffers?

Looking at ppp.c in rp-pppoe, I note there is N_HDLC vs plain read, and to be honest, it looks like possible buffer size mis-calculations here, why not simply:

Can we please wrap-up pppoe binary in a script similar to this?

#! /bin/sh

exec strace -o /tmp/pppoe.txt "$0.real"

Rename existing binary to pppoe.real and create above script in it's place. That way at least we can confirm if we're looking at the N_HDLC or "plain read" variant, and verify return codes coming from the system calls.

What's the purpose of throwing away the first two bytes of the read btw? What are the content of those two bytes?

dfskoll commented 5 months ago

The relevant commit which last touched that code, what kernel versions are you testing on that provokes this error? Is there any known kernel version that does NOT provoke this?

I don't know, and I'm not going to test it, since I no longer use rp-pppoe myself in production. Installing different kernels is just too much of a pain for me.

Having looked at that commit as a whole, it's clear that nr==0, which I don't get how that can be possible.

You have to go back to the caller, which can reset nr to zero but still make another call to the read function, if it fails to copy data to user-space. The line is here

Can you verify the "multiple packets/read theory?"

Yes. I temporarily greatly increased the buffer size in rp-pppoe, and saw that it was reading 1608 bytes. I've truncated the strace output a bit:

read(0, "\xff\x03...\x00\x00", 10504) = 1608

I captured the packet and converted it to pcap format to look at it with Wireshark. As you see from the attached dump, there's another PPP frame at offset 0x5ea (1514). (Note: I hand-edited the dump to "pretend" it is PPPoE so Wireshark can intelligently decode it, so the offset in the actual read data is 1496. capture.zip )

The extra 14 bytes is your +8 for ETH_JUMBO_LEN, and the 6 bytes of PPPoE data in PPPoEPacketStruct which really is part of the ethernet payload. I suspect this is a miscalc, but frankly, in the right direction, so this won't cause harm.

Yup. I knew I was over-estimating.

Ok, I suspect that's POST the 2021 commit, so can we verify that it works with some kernel released prior to that date, and fails with some kernel shortly after?

That's what needs testing... but I can't do it.

What's the purpose of throwing away the first two bytes of the read btw? What are the content of those two bytes?

Those are the PPP address and control bytes, which are always 0xFF, 0x03 and are not transmitted as part of the PPPoE frame. Those are leftovers from the HDLC spec.

dfskoll commented 5 months ago

Here's a hex dump of the actual return from read. The second PPP frame begins at offset 0x5d8 = 1496

That makes sense if the mtu of the PPP interface is 1492, because we have the two PPP frame bytes of 0xff, 0x03 followed by the two-byte PPP protocol for IPv4 (0x00, 0x21).

hexdump.txt

jkroonza commented 5 months ago

Can you confirm if you compile with the N_HDLC path if you observe the same problem? This could be highly problematic for the sync version of the code, but this is almost certainly a kernel bug, seemingly introduced by the big man himself.

dfskoll commented 5 months ago

Can you confirm if you compile with the N_HDLC path if you observe the same problem?

I'm not sure what you mean by "compile with the N_HDLC path". Could you clarify? Or did you mean patch?

Hmm.... n_hdlc is a module so I suppose I could try compiling an earlier version which wouldn't require replacing the entire kernel... let me investigate...

dfskoll commented 5 months ago

I reckon it's either that commit or this one: https://github.com/torvalds/linux/commit/dd78b0c483e33225e0e0782b0ed887129b00f956

mark-dawn commented 5 months ago

One small further confirmation that this is a kernel bug or at least not something on rp-pppoe side: since all this "new"kernel plumbing actually enables readv() to work with the N_HDLC linedisc, I tried to enable the iovec related read code here, without changing the HAVE_N_HDLC def elsewhere, and indeed the code now works...but will still terminate with the same EOVERFLOW:

[...]
readv(0, [{iov_base="\377\3", iov_len=2}, {iov_base="\0!E\0\3w\21\t@\0@\6:3dB\200\274#\272\343\214\226\304\1\273\233;\367j\263\346"..., iov_len=1502}], 2) = 891 <0.000008>
[...]
readv(0, [{iov_base=0x7fff3bbaca76, iov_len=2}, {iov_base=0x7fff3bbacb64, iov_len=1502}], 2) = -1 EOVERFLOW (Value too large for defined data type) <0.000008>
write(2, "pppoe: read (syncReadFromPPP): S"..., 82) = 82 <0.000010>
[...]

not particularly helpful but it really nails down the culprit to the new kernel code, or some mistake on the ppp side. One thing that bugs me is that several reads before go through just fine, and it starts going awry only when the big boy traffic starts, so perhaps it's really a mistake on the ppp, especially since the kernel gives back a packet stuffed full with half of the next packet

jkroonza commented 5 months ago

Can you confirm if you compile with the N_HDLC path if you observe the same problem?

I'm not sure what you mean by "compile with the N_HDLC path". Could you clarify? Or did you mean patch?

Hmm.... n_hdlc is a module so I suppose I could try compiling an earlier version which wouldn't require replacing the entire kernel... let me investigate...

In ppp.c there's two different mechanisms to read() from the kernel, the N_HDLC path uses readv rather than read.

jkroonza commented 5 months ago

One small further confirmation that this is a kernel bug or at least not something on rp-pppoe side: since all this "new"kernel plumbing actually enables readv() to work with the N_HDLC linedisc, I tried to enable the iovec related read code here, without changing the HAVE_N_HDLC def elsewhere, and indeed the code now works...but will still terminate with the same EOVERFLOW:

[...]
readv(0, [{iov_base="\377\3", iov_len=2}, {iov_base="\0!E\0\3w\21\t@\0@\6:3dB\200\274#\272\343\214\226\304\1\273\233;\367j\263\346"..., iov_len=1502}], 2) = 891 <0.000008>
[...]
readv(0, [{iov_base=0x7fff3bbaca76, iov_len=2}, {iov_base=0x7fff3bbacb64, iov_len=1502}], 2) = -1 EOVERFLOW (Value too large for defined data type) <0.000008>
write(2, "pppoe: read (syncReadFromPPP): S"..., 82) = 82 <0.000010>
[...]

not particularly helpful but it really nails down the culprit to the new kernel code, or some mistake on the ppp side. One thing that bugs me is that several reads before go through just fine, and it starts going awry only when the big boy traffic starts, so perhaps it's really a mistake on the ppp, especially since the kernel gives back a packet stuffed full with half of the next packet

I'm thinking kernel, because that's the only common factor. We know older ppp worked with 3.15 on rp-pppoe, however, testing shows that same but on newer kernel fails. I'm thinking it's one of those two patches and reckon that we've got enought to take this to the LKML (specifically the net-dev sub list).

jkroonza commented 5 months ago

I reckon it's either that commit or this one: torvalds/linux@dd78b0c

This one feels less likely IMHO but not impossible either.

I'll draft an email tomorrow and post it here before taking it to the netdev list.

dfskoll commented 5 months ago

What seems to be happening is that the kernel is trying to return multiple frames in one read call when it should only ever return a single frame. If you manually set the MTU of the ppp0 interface to something low, like 640, before you try transmitting any bulk data, it doesn't crash. Somewhere between mtu=640 and mtu=800, I see the crash happening. So the bug might not be in the n_hdlc.c or tty_io.c code... it could be in the PPP subsystem.

mark-dawn commented 5 months ago

I can confirm that this is the case. This really makes me think it's a bug in the ppp subsystem, because I can't really see why would the kernel read_iter facility stop stuffing two packets in one buffer once the buffer is double the length. And it's not doing that because once I set the mtu of the ppp interface to 600 my connection to the ISP works fine. If I was sending out chimeric packets I would expect it to not work.

dfskoll commented 5 months ago

Well... it could work even with chimeric packets because TCP is pretty tolerant of dropped packets, and I think the first packet in the mixture would correctly make it through.

mark-dawn commented 5 months ago

Well would it still be quite sluggish? I felt like the page I loaded loaded in an instant anyway...but you know, not very scientific measure I'll admit. I'll check later with wireshark if I can prove it. After dinner.

dfskoll commented 5 months ago

OK, this is weird. I wrote this little test program that puts a pseuto-tty into N_HDLC mode. I make two writes to one side of the pty, sleep a bit, and then read from the other side. Sometimes I read back the data as it was written, in two separate chunks... and sometimes I get back the whole thing in one chunk. I'm sure this is not the right behaviour.

Some runs:

read(4, buf, 2048) = 6
buf = |Hello |
read(4, buf, 2048) = 5
buf = |world|

Other runs:

read(4, buf, 2048) = 11
buf = |Hello world|

test_n_hdlc.tar.gz

dfskoll commented 5 months ago

All right! Nailed it down to a race condition in the tty handler. See updated attached test file.

If you run: ./test_n_hdlc a few times, you'll see that sometimes it combines the two writes into a single read.

If you run: ./test_n_hdlc foo (any arg can be used in place of foo) then I add a 0.1s sleep between the writes, and I have never seen them combined into a single read.

test_n_hdlc-ver-2.tar.gz

mark-dawn commented 5 months ago

Excellent find! I can confirm similar behavior by playing with the first example, so it's a race condition in the N_HDLC linedisc, probably from the ppp side writing very quickly and coalescing the message in a single buffer. So the bug is probably in the write functions, since in those examples the reads never interleave with the writes, so the bug cannot originate from the interplay between read and writes calls. Luckily because the ppp subsystem is quite large and combing for a bug there would be worse.

This is my test with the whole first ecxample run in a tight 500 iters loop and collecting the number of failures:

$ ./test_n_hdlc 
posix_openpt = 3
grantpt(3)  = 0
unlockpt(3) = 0
ptsname(3) = /dev/pts/4
open(/dev/pts/4) = 4
ioctl(3, TIOCSETD) = 0
ioctl(4, TIOCSETD) = 0
total failures: 193 (96.500000%)

and this with a nanosleep between the two writes:

$ ./test_n_hdlc 
posix_openpt = 3
grantpt(3)  = 0
unlockpt(3) = 0
ptsname(3) = /dev/pts/4
open(/dev/pts/4) = 4
ioctl(3, TIOCSETD) = 0
ioctl(4, TIOCSETD) = 0
total failures: 0 (0.000000%)
dfskoll commented 5 months ago

I've reported this to the TTY subsystem maintainers, who are listed as Greg KH and Jiri Slaby.

dfskoll commented 5 months ago

Final update: I have not been able to narrow down the bad commit, but the kernel bug must be pretty old... my test program passed on Debian 10 (kernel 4.19.67) and failed on Debian 11 (kernel 5.10.209). So the bug was introduced somewhere in between those two versions.

jkroonza commented 5 months ago

Damn! That's crazy. But given that even when the rest of the world was starting to use kernel 2.6 the montavista embedded stuff (which was considered the way to go with embedded systems by commercial teams at the time) kernel was at the time still running 2.2 with no plans to migrate even to 2.4 ... I'm not surprised this hasn't been picked up before.

I have a couple of questions though:

IP world is generally considered async, so why do we care about sync at all?

Does this indicate that no one is using userspace pppoe (I think we already confirmed this as a "no userspace pppoe is still in use") or does it indicate that (almost) no one use using sync-mode pppoe?

I can obviously raise my suspicions, but I've got no sensible way to validate that.

Does it imply we have to somehow find packet starts in a single stream?

If I had any chance of booting an old enough kernel (NVMe code was added more recently) I'd happily go git bisect on this but I'm afraid I don't, @dfskoll do you have a test environment that you're using which I can clone (perhaps just a disk image and kvm settings?) so that we can go git bisect? Alterantively, might be worthwhile to just add your findings for Greg and Jiri? Btw, I would add an appropriate ML from https://lore.kernel.org/ too - I just can't figure out which one (perhaps https://lore.kernel.org/linux-serial/ - def seeing tty related mail on that one)

Alternatively https://bugzilla.kernel.org/ seems fairly active too.

dfskoll commented 5 months ago

IP world is generally considered async, so why do we care about sync at all?

N_HDLC is a TTY line discipline that is supposed to make sure that the data written by each write system call is read by exactly one read system call, rather than merging data from multiple write calls into one. It's done this way so that PPP frame boundaries are preserved and we don't need to expensively scan the data we get from the TTY looking for PPP frame delimiters.

Does this indicate that no one is using userspace pppoe (I think we already confirmed this as a "no userspace pppoe is still in use") or does it indicate that (almost) no one use using sync-mode pppoe?

Probably both?

Does it imply we have to somehow find packet starts in a single stream?

There's already code in the async functions that scan for frame delimiters. N_HDLC was supposed to obviate the need for that.

If I had any chance of booting an old enough kernel (NVMe code was added more recently) I'd happily go git bisect on this but I'm afraid I don't, @dfskoll do you have a test environment that you're using which I can clone (perhaps just a disk image and kvm settings?) so that we can go git bisect?

I don't have a test environment I can share, unfortunately. I tried git bisect but the really old kernels wouldn't compile on my system and I gave up because (1) I'm not a kernel developer and (2) I don't care all that much about this bug because there are two workarounds (kernel-mode PPPoE or the async user mode.)

Alterantively, might be worthwhile to just add your findings for Greg and Jiri? Btw, I would add an appropriate ML from https://lore.kernel.org/ too - I just can't figure out which one (perhaps https://lore.kernel.org/linux-serial/ - def seeing tty related mail on that one)

I reported at https://lore.kernel.org/linux-serial/20240424173114.035ddd7b@gato.skoll.ca/T/#u and have had no replies, so it seems this isn't too much of a priority.

jkroonza commented 5 months ago

Ok then. I'm going to close this on the basis that we can't fix it, kernel bug. There are two ways around this, in order of preference:

  1. If you can (and I believe this applies to all Linux environments) use kernel-mode pppoe.
  2. If you can't, use async userspace, not sure if it's faster/slower compared to sync, but if you're using userspace pppoe you're already down the poor performance route so I can't say I particularly care that much.

If you do care, drive the bug at https://lore.kernel.org/linux-serial/20240424173114.035ddd7b@gato.skoll.ca/T/#u.