ext4: Infinite wait on 'sync_page'

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Trying to reboot router with USB boot device connected.
2. Lauching a scripts with cron daemon.
3.

What is the expected output? What do you see instead?

What version of the product are you using?
r4783

Please provide any additional information below.
I don't have problems at all with r4603 and Busybox is the same version as 
r4783, something is wrong with this last builds. I tested r4770, r4772, r4783, 
same problem exist.

BusyBox v1.20.2 (2012-09-18 02:10:56 MSK) multi-call binary.

I have 3 hdds connected, the problem only exist when i have connected the fist 
hdd (boot hdd), or if i dont boot it (boot as 2nd or 3nd disc / no boot hdd)

Original issue reported on code.google.com by hggo...@gmail.com on 4 Jan 2013 at 2:01

GoogleCodeExporter commented 9 years ago

I will try r4667 next, if it works ok i will try > r4667 until the problem 
exist again, that way will be easier to identify the issue based on version 
build changelog.

Thanks.

Original comment by hggo...@gmail.com on 4 Jan 2013 at 3:13

GoogleCodeExporter commented 9 years ago

I can't reproduce such behavior, so will wait until you able to provide console 
logs and/or revision bisect.

Original comment by lly.dev on 4 Jan 2013 at 4:57

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

r4667 works perfectly aswell, next try will be r470x.

Original comment by hggo...@gmail.com on 5 Jan 2013 at 6:03

GoogleCodeExporter commented 9 years ago

I'll suggest revision bisect - i.e. r4725, after that r4752(up) or r4695(down), 
etc...

Original comment by lly.dev on 6 Jan 2013 at 3:32

GoogleCodeExporter commented 9 years ago

Roger that, i will let you know soon, thanks. ;-)

Original comment by hggo...@gmail.com on 6 Jan 2013 at 11:37

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I decided testing r4705 and it works just fine :-) that means the issue exists 
between r4705 and r4770 version, tomorrow i will try r4725. Btw, in case i 
found the version where the problem shows what console logs you want/need? 
Syslog? Problem is when i reboot router log will not show. ;-)

Original comment by hggo...@gmail.com on 7 Jan 2013 at 4:43

GoogleCodeExporter commented 9 years ago

Due to syslog turned off on reboot, serial console logs can be useful only.

Original comment by lly.dev on 7 Jan 2013 at 7:40

GoogleCodeExporter commented 9 years ago

Typo: I installed r4709 not r4705.

I decided testing r4709 and it works just fine :-) that means the issue exists 
between r4709 and r4770 version, tomorrow i will try r4725. Btw, in case i 
found the version where the problem shows what console logs you want/need? 
Syslog? Problem is when i reboot router log will not show. ;-)

Original comment by hggo...@gmail.com on 7 Jan 2013 at 2:42

GoogleCodeExporter commented 9 years ago

Tested r4725 and r4717 same situation. Problem exists between r4709 and r4717, 
i believe it's related to ext4 filesystem (based on changelogs), that's the 
filesystem im using.

I need r4710=>r4716 to be sure which one is causing it.

Thanks

Original comment by hggo...@gmail.com on 8 Jan 2013 at 4:35

GoogleCodeExporter commented 9 years ago

I need your router model to prepare test build.

Please try to manually unmount ext4 volumes before reboot. Will umount 
succeeded?

Original comment by lly.dev on 9 Jan 2013 at 12:29

GoogleCodeExporter commented 9 years ago

Model is RT-N16. I think i cannot umount it even if i kill all services running 
on the boot HDD.

Original comment by hggo...@gmail.com on 9 Jan 2013 at 3:17

GoogleCodeExporter commented 9 years ago

Yes, you have to stop external services. Unfortunately, I can't reproduce the 
problem with ext4 fs. Are you sure that jbd2/ext4 modules corresponds kernel 
version?

Original comment by lly.dev on 10 Jan 2013 at 7:11

GoogleCodeExporter commented 9 years ago

Sure i have, modules are fine, they always match with the fw version installed. 
I cannot understand the reason why you cannot reproduce it. I dont have serial 
console on this router so i guess it will be hard to identity the problem...

Original comment by hggo...@gmail.com on 10 Jan 2013 at 8:32

GoogleCodeExporter commented 9 years ago

What version have you tested? Have you used a boot script? If i connect HDDS 
only (without services booting on that device) i can reboot also, the problem 
exists only when the HDD is running a boot script / launching services.

Original comment by hggo...@gmail.com on 10 Jan 2013 at 8:47

GoogleCodeExporter commented 9 years ago

Could it be related with swap partition on the HDD? On my boot script i have 
#swapon /dev/sda2

Original comment by hggo...@gmail.com on 10 Jan 2013 at 9:05

GoogleCodeExporter commented 9 years ago

I should test r4710=>r4716 to be sure which one is causing it, can you supply 
those versions to me? Thanks.

Original comment by hggo...@gmail.com on 10 Jan 2013 at 9:14

GoogleCodeExporter commented 9 years ago

Please be patient, we unable to provide 24h support.

Original comment by lly.dev on 10 Jan 2013 at 3:59

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Im pacient, no hurry's. Why you say that? Just because i posted 4 times? Take 
your time...

Original comment by hggo...@gmail.com on 10 Jan 2013 at 4:34

GoogleCodeExporter commented 9 years ago

I apologize that most of answers are trivial:
1) All of us, except special cases, always test current trunk
2) I use fstab & trivial post-* scripts. No large SysV5 rc startup scripts
3) I don't think that problem related with swap, but can't guarantee it
4) Since you hasn't console, you have to test unmount procedure - it can be 
done without stop of system services like syslog etc.
5) My test-case is simple - mount ext4 volume, write+read some data, unmount, 
reboot

r4714 for test:
http://wl500g.googlecode.com/files/RT-N16-1.9.2.7-rtn-r4714.trx
http://wl500g.googlecode.com/files/modules-1.9.2.7-rtn-r4714-MIPS32r2.tgz

Original comment by lly.dev on 12 Jan 2013 at 9:33

GoogleCodeExporter commented 9 years ago

No problem ;)

Probably r4714 is ok, the problem could start at r4710,11,12,13,14,15,16. Your 
test on ext4 filesystem should be on higher versions like i reported before. 
But im going to test r4714 and get back to you ;-)

Thanks

Original comment by hggo...@gmail.com on 12 Jan 2013 at 4:06

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I guess i was right, r4714 works just fine, it means its r4715 or r4716 causing 
the problem. I need both to test them and see which one is creating this 
situation.

I've tested last FW (r4788) and the problem exists again, so the conclusion is 
this:

< or = r4714 => ok
> or = r4717 => does not reboot / shutdown

Thanks once again.

Original comment by hggo...@gmail.com on 13 Jan 2013 at 4:08

GoogleCodeExporter commented 9 years ago

From what i've seen on changelogs (r4715 and r4716) i bet it's the last one:

kernel: ext4: wait for writeback to complete while making pages writable

Answering your statements:

1) So if is tested how no one noticed this problem? Not using ext4 filesystem?
2) Also.
3) Neither do i.
4) I could never umount the boot drive (drive or resource busy) i cannot find 
what is using the drive, i killed all processes...
5) Answered on 4.

Original comment by hggo...@gmail.com on 13 Jan 2013 at 4:32

GoogleCodeExporter commented 9 years ago

Depends on issue. For this issue I test USB flash with ext4 filesystem, as I 
wrote it before.

You use external USB drive as root fs? It will be "boot drive" in such case 
only. Otherwise, it is ordinary fs. Please show your mount list.

You can find process which uses mount point with "fuser" utility, "fuser -c 
/mnt" for example.

For r4717 you should refresh modules on top of r4714 - 
http://wl500g.googlecode.com/files/modules-1.9.2.7-rtn-r4717-MIPS32r2.tgz

Original comment by lly.dev on 13 Jan 2013 at 2:42

GoogleCodeExporter commented 9 years ago

Yes i'm using USB drive as root fs.

/dev/sda1             56467292  13991048  39607872  26% /opt => root fs
/dev/sdb1            960643212 838166384  73678992  92% /tmp/mnt => ordinary fs
/dev/sdc1            960649188 891470768  20376556  98% /tmp/mnt2 => ordinary fs

I already tested r4717 as you can see on my previous posts, it does not reboot.

The versions i need for testing are r4715 and r4716.

After killing all process i see this with fuser:

 USER        PID ACCESS COMMAND
/opt:                root     kernel mount /opt

Original comment by hggo...@gmail.com on 13 Jan 2013 at 7:15

GoogleCodeExporter commented 9 years ago

With 99% probability it is due to r4716

Are you able to remount rootfs read-only?

External rootfs can uncover a lot of problems. I will try to reproduce it...

P.S. Anyway, you should write about USB rootfs at issue creation.

Original comment by lly.dev on 14 Jan 2013 at 8:42

GoogleCodeExporter commented 9 years ago

I'm pretty sure it's r4716 too.

Never try that, but i'm sure i can.

Why you say that? I always used external rootfs and never seen nothing so far. 
Ok good luck, you will reproduce it for sure...

Original comment by hggo...@gmail.com on 14 Jan 2013 at 2:57

GoogleCodeExporter commented 9 years ago

Due to ext4 can't be used as external rootfs without modifying "linuxrc" script 
and/or firmware image or special custom scripts. Since you use unmodified FW, 
seems to be you use custom script, please provide it (by pastebin.com, email, 
etc. - depends on it's size).

Original comment by lly.dev on 15 Jan 2013 at 8:09

GoogleCodeExporter commented 9 years ago

Well that's really a surprise for me, since i ALWAYS used ext4 and never got a 
problem with it...

Original comment by hggo...@gmail.com on 15 Jan 2013 at 11:09

GoogleCodeExporter commented 9 years ago

I can now umount /opt manually,  it's not a problem anymore. But r4716 or 
greater has the same issue, no reboot, the issue continues, that means it's not 
related. Syslog does not shows nothing, can you suspend r4716?

Original comment by hggo...@gmail.com on 19 Jan 2013 at 3:57

GoogleCodeExporter commented 9 years ago

You are the single who experience such problem. Unless we can determine root of 
the problem, I don't want to revert r4716. Additionally, it has positive 
comments from some customers.

The next step, I guess, is try to unmount all ext4 volumes with reboot 
following.

Original comment by lly.dev on 19 Jan 2013 at 6:56

GoogleCodeExporter commented 9 years ago

I can i try that and let you know, but i cannot see any positive comments on 
r4716. Tks

Original comment by hggo...@gmail.com on 19 Jan 2013 at 12:40

GoogleCodeExporter commented 9 years ago

Finally i found the problem, samba service(smbd and nmbd)get stucked on >r4616, 
the process cannot be killed that's why reboot doesn't work:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       691  0.0  0.8  10332  1048 ?        Ds   04:25   0:00 /opt/sbin/nmbd 
-D
root       693  0.0  1.3  18396  1664 ?        Ds   04:25   0:00 /opt/sbin/smbd 
-D

< r4616:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       694  0.0  0.7  10327  1048 ?        S   03:21   0:00 /opt/sbin/nmbd 
-D
root       696  0.0  1.2  17315  1664 ?        S   03:21   0:00 /opt/sbin/smbd 
-D

STAT Ds => Cannot be killed then no reboot is possible
STAT S => Works just fine

Question here is why >r4616 does samba service work on STAT Ds and not on S?

Original comment by hggo...@gmail.com on 20 Jan 2013 at 4:46

GoogleCodeExporter commented 9 years ago

The major question I forgot to ask you - you are using Optware or Entware?

Original comment by lly.dev on 20 Jan 2013 at 6:14

GoogleCodeExporter commented 9 years ago

I'm using Optware.

Original comment by hggo...@gmail.com on 20 Jan 2013 at 2:36

GoogleCodeExporter commented 9 years ago

It might be artifacts of ancient uClibc used in Optware.
1) check proper usage of LD_LIBRARY_PATH (unset it best of all)
2) find on which event samba is waiting waiting - "cat /proc/<pid>/wchan"

Original comment by lly.dev on 20 Jan 2013 at 3:45

GoogleCodeExporter commented 9 years ago

1) I never set LD_LIBRARY_PATH
2) [root@proxy /]$ cat /proc/852/wchan
sync_page
[root@proxy /]$

Original comment by hggo...@gmail.com on 20 Jan 2013 at 8:35

GoogleCodeExporter commented 9 years ago

So far i cannot get rid of this Samba problem on >r4716, the only way was using 
<r4716.

I need to understand what's inside of r4716 that's causing this on Samba.

Original comment by hggo...@gmail.com on 21 Jan 2013 at 2:44

GoogleCodeExporter commented 9 years ago

Investigations will take some time, I can't answer to your question at now.

Original comment by lly.dev on 21 Jan 2013 at 4:02

GoogleCodeExporter commented 9 years ago

Sure no problem, when you've got the time :-) i will stay with r4714 in the 
meantime.

Thanks.

Original comment by hggo...@gmail.com on 21 Jan 2013 at 5:36

GoogleCodeExporter commented 9 years ago

Another soluction to this problem would be using EXT3 filesystem instead of 
EXT4...

Original comment by hggo...@gmail.com on 21 Jan 2013 at 5:44

GoogleCodeExporter commented 9 years ago

Which version of samba your use? In case of samba 3, is "use sendfile" 
parameter set?
To achieve problem, is it enough to start samba without accessing network 
shares?

Original comment by lly.dev on 22 Jan 2013 at 11:46

GoogleCodeExporter commented 9 years ago

I always used version 3, 3.5 and now 3.6. On < r4716 it always worked well (and 
it's working) after >r4716 the problem appeared on ANY version of Samba or 
config setting, even with or without sendfile option. 

Well, i think so, never try that, but i bet the same will happen i can try and 
then get back to you, i really don't know if it justify's the time "wasted" 
trying to solve this knowing that's r4716 causing it, but we can try exaust all 
options to achieve the problem, you can count me in on this.

Thanks once again lly

Original comment by hggo...@gmail.com on 22 Jan 2013 at 2:28

GoogleCodeExporter commented 9 years ago

Forget to tell you, if you use "smbd -i (interactive mode) -d 9 (debug level 
9)" you will see everything what's happening and i cannot see any kind of 
problems/error, it's going normal, what i can see it's the process getting 
stucked.

Original comment by hggo...@gmail.com on 22 Jan 2013 at 2:34

GoogleCodeExporter commented 9 years ago

Another important one, all services related with Samba (smbd, nmbd, swat) when 
executed  go to STAT D and get stucked, so i guess it's not any kind of 
specific config on smb.conf.

Original comment by hggo...@gmail.com on 22 Jan 2013 at 3:11

GoogleCodeExporter commented 9 years ago

I'm not familiar with Samba, so, to reproduce the problem at home, I have to 
consider all choices.

Original comment by lly.dev on 22 Jan 2013 at 3:52

GoogleCodeExporter commented 9 years ago

That's correct lly, good luck :-)

Original comment by hggo...@gmail.com on 22 Jan 2013 at 4:29

BackupGGCode / wl500g

ext4: Infinite wait on 'sync_page' #373