Open GoogleCodeExporter opened 9 years ago
I will try r4667 next, if it works ok i will try > r4667 until the problem
exist again, that way will be easier to identify the issue based on version
build changelog.
Thanks.
Original comment by hggo...@gmail.com
on 4 Jan 2013 at 3:13
I can't reproduce such behavior, so will wait until you able to provide console
logs and/or revision bisect.
Original comment by lly.dev
on 4 Jan 2013 at 4:57
[deleted comment]
r4667 works perfectly aswell, next try will be r470x.
Original comment by hggo...@gmail.com
on 5 Jan 2013 at 6:03
I'll suggest revision bisect - i.e. r4725, after that r4752(up) or r4695(down),
etc...
Original comment by lly.dev
on 6 Jan 2013 at 3:32
Roger that, i will let you know soon, thanks. ;-)
Original comment by hggo...@gmail.com
on 6 Jan 2013 at 11:37
[deleted comment]
I decided testing r4705 and it works just fine :-) that means the issue exists
between r4705 and r4770 version, tomorrow i will try r4725. Btw, in case i
found the version where the problem shows what console logs you want/need?
Syslog? Problem is when i reboot router log will not show. ;-)
Original comment by hggo...@gmail.com
on 7 Jan 2013 at 4:43
Due to syslog turned off on reboot, serial console logs can be useful only.
Original comment by lly.dev
on 7 Jan 2013 at 7:40
Typo: I installed r4709 not r4705.
I decided testing r4709 and it works just fine :-) that means the issue exists
between r4709 and r4770 version, tomorrow i will try r4725. Btw, in case i
found the version where the problem shows what console logs you want/need?
Syslog? Problem is when i reboot router log will not show. ;-)
Original comment by hggo...@gmail.com
on 7 Jan 2013 at 2:42
Tested r4725 and r4717 same situation. Problem exists between r4709 and r4717,
i believe it's related to ext4 filesystem (based on changelogs), that's the
filesystem im using.
I need r4710=>r4716 to be sure which one is causing it.
Thanks
Original comment by hggo...@gmail.com
on 8 Jan 2013 at 4:35
I need your router model to prepare test build.
Please try to manually unmount ext4 volumes before reboot. Will umount
succeeded?
Original comment by lly.dev
on 9 Jan 2013 at 12:29
Model is RT-N16. I think i cannot umount it even if i kill all services running
on the boot HDD.
Original comment by hggo...@gmail.com
on 9 Jan 2013 at 3:17
Yes, you have to stop external services. Unfortunately, I can't reproduce the
problem with ext4 fs. Are you sure that jbd2/ext4 modules corresponds kernel
version?
Original comment by lly.dev
on 10 Jan 2013 at 7:11
Sure i have, modules are fine, they always match with the fw version installed.
I cannot understand the reason why you cannot reproduce it. I dont have serial
console on this router so i guess it will be hard to identity the problem...
Original comment by hggo...@gmail.com
on 10 Jan 2013 at 8:32
What version have you tested? Have you used a boot script? If i connect HDDS
only (without services booting on that device) i can reboot also, the problem
exists only when the HDD is running a boot script / launching services.
Original comment by hggo...@gmail.com
on 10 Jan 2013 at 8:47
Could it be related with swap partition on the HDD? On my boot script i have
#swapon /dev/sda2
Original comment by hggo...@gmail.com
on 10 Jan 2013 at 9:05
I should test r4710=>r4716 to be sure which one is causing it, can you supply
those versions to me? Thanks.
Original comment by hggo...@gmail.com
on 10 Jan 2013 at 9:14
Please be patient, we unable to provide 24h support.
Original comment by lly.dev
on 10 Jan 2013 at 3:59
[deleted comment]
Im pacient, no hurry's. Why you say that? Just because i posted 4 times? Take
your time...
Original comment by hggo...@gmail.com
on 10 Jan 2013 at 4:34
I apologize that most of answers are trivial:
1) All of us, except special cases, always test current trunk
2) I use fstab & trivial post-* scripts. No large SysV5 rc startup scripts
3) I don't think that problem related with swap, but can't guarantee it
4) Since you hasn't console, you have to test unmount procedure - it can be
done without stop of system services like syslog etc.
5) My test-case is simple - mount ext4 volume, write+read some data, unmount,
reboot
r4714 for test:
http://wl500g.googlecode.com/files/RT-N16-1.9.2.7-rtn-r4714.trx
http://wl500g.googlecode.com/files/modules-1.9.2.7-rtn-r4714-MIPS32r2.tgz
Original comment by lly.dev
on 12 Jan 2013 at 9:33
No problem ;)
Probably r4714 is ok, the problem could start at r4710,11,12,13,14,15,16. Your
test on ext4 filesystem should be on higher versions like i reported before.
But im going to test r4714 and get back to you ;-)
Thanks
Original comment by hggo...@gmail.com
on 12 Jan 2013 at 4:06
[deleted comment]
I guess i was right, r4714 works just fine, it means its r4715 or r4716 causing
the problem. I need both to test them and see which one is creating this
situation.
I've tested last FW (r4788) and the problem exists again, so the conclusion is
this:
< or = r4714 => ok
> or = r4717 => does not reboot / shutdown
Thanks once again.
Original comment by hggo...@gmail.com
on 13 Jan 2013 at 4:08
From what i've seen on changelogs (r4715 and r4716) i bet it's the last one:
kernel: ext4: wait for writeback to complete while making pages writable
Answering your statements:
1) So if is tested how no one noticed this problem? Not using ext4 filesystem?
2) Also.
3) Neither do i.
4) I could never umount the boot drive (drive or resource busy) i cannot find
what is using the drive, i killed all processes...
5) Answered on 4.
Original comment by hggo...@gmail.com
on 13 Jan 2013 at 4:32
Depends on issue. For this issue I test USB flash with ext4 filesystem, as I
wrote it before.
You use external USB drive as root fs? It will be "boot drive" in such case
only. Otherwise, it is ordinary fs. Please show your mount list.
You can find process which uses mount point with "fuser" utility, "fuser -c
/mnt" for example.
For r4717 you should refresh modules on top of r4714 -
http://wl500g.googlecode.com/files/modules-1.9.2.7-rtn-r4717-MIPS32r2.tgz
Original comment by lly.dev
on 13 Jan 2013 at 2:42
Yes i'm using USB drive as root fs.
/dev/sda1 56467292 13991048 39607872 26% /opt => root fs
/dev/sdb1 960643212 838166384 73678992 92% /tmp/mnt => ordinary fs
/dev/sdc1 960649188 891470768 20376556 98% /tmp/mnt2 => ordinary fs
I already tested r4717 as you can see on my previous posts, it does not reboot.
The versions i need for testing are r4715 and r4716.
After killing all process i see this with fuser:
USER PID ACCESS COMMAND
/opt: root kernel mount /opt
Original comment by hggo...@gmail.com
on 13 Jan 2013 at 7:15
With 99% probability it is due to r4716
Are you able to remount rootfs read-only?
External rootfs can uncover a lot of problems. I will try to reproduce it...
P.S. Anyway, you should write about USB rootfs at issue creation.
Original comment by lly.dev
on 14 Jan 2013 at 8:42
I'm pretty sure it's r4716 too.
Never try that, but i'm sure i can.
Why you say that? I always used external rootfs and never seen nothing so far.
Ok good luck, you will reproduce it for sure...
Original comment by hggo...@gmail.com
on 14 Jan 2013 at 2:57
Due to ext4 can't be used as external rootfs without modifying "linuxrc" script
and/or firmware image or special custom scripts. Since you use unmodified FW,
seems to be you use custom script, please provide it (by pastebin.com, email,
etc. - depends on it's size).
Original comment by lly.dev
on 15 Jan 2013 at 8:09
Well that's really a surprise for me, since i ALWAYS used ext4 and never got a
problem with it...
Original comment by hggo...@gmail.com
on 15 Jan 2013 at 11:09
I can now umount /opt manually, it's not a problem anymore. But r4716 or
greater has the same issue, no reboot, the issue continues, that means it's not
related. Syslog does not shows nothing, can you suspend r4716?
Original comment by hggo...@gmail.com
on 19 Jan 2013 at 3:57
You are the single who experience such problem. Unless we can determine root of
the problem, I don't want to revert r4716. Additionally, it has positive
comments from some customers.
The next step, I guess, is try to unmount all ext4 volumes with reboot
following.
Original comment by lly.dev
on 19 Jan 2013 at 6:56
I can i try that and let you know, but i cannot see any positive comments on
r4716. Tks
Original comment by hggo...@gmail.com
on 19 Jan 2013 at 12:40
Finally i found the problem, samba service(smbd and nmbd)get stucked on >r4616,
the process cannot be killed that's why reboot doesn't work:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 691 0.0 0.8 10332 1048 ? Ds 04:25 0:00 /opt/sbin/nmbd
-D
root 693 0.0 1.3 18396 1664 ? Ds 04:25 0:00 /opt/sbin/smbd
-D
< r4616:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 694 0.0 0.7 10327 1048 ? S 03:21 0:00 /opt/sbin/nmbd
-D
root 696 0.0 1.2 17315 1664 ? S 03:21 0:00 /opt/sbin/smbd
-D
STAT Ds => Cannot be killed then no reboot is possible
STAT S => Works just fine
Question here is why >r4616 does samba service work on STAT Ds and not on S?
Original comment by hggo...@gmail.com
on 20 Jan 2013 at 4:46
The major question I forgot to ask you - you are using Optware or Entware?
Original comment by lly.dev
on 20 Jan 2013 at 6:14
I'm using Optware.
Original comment by hggo...@gmail.com
on 20 Jan 2013 at 2:36
It might be artifacts of ancient uClibc used in Optware.
1) check proper usage of LD_LIBRARY_PATH (unset it best of all)
2) find on which event samba is waiting waiting - "cat /proc/<pid>/wchan"
Original comment by lly.dev
on 20 Jan 2013 at 3:45
1) I never set LD_LIBRARY_PATH
2) [root@proxy /]$ cat /proc/852/wchan
sync_page
[root@proxy /]$
Original comment by hggo...@gmail.com
on 20 Jan 2013 at 8:35
So far i cannot get rid of this Samba problem on >r4716, the only way was using
<r4716.
I need to understand what's inside of r4716 that's causing this on Samba.
Original comment by hggo...@gmail.com
on 21 Jan 2013 at 2:44
Investigations will take some time, I can't answer to your question at now.
Original comment by lly.dev
on 21 Jan 2013 at 4:02
Sure no problem, when you've got the time :-) i will stay with r4714 in the
meantime.
Thanks.
Original comment by hggo...@gmail.com
on 21 Jan 2013 at 5:36
Another soluction to this problem would be using EXT3 filesystem instead of
EXT4...
Original comment by hggo...@gmail.com
on 21 Jan 2013 at 5:44
Which version of samba your use? In case of samba 3, is "use sendfile"
parameter set?
To achieve problem, is it enough to start samba without accessing network
shares?
Original comment by lly.dev
on 22 Jan 2013 at 11:46
I always used version 3, 3.5 and now 3.6. On < r4716 it always worked well (and
it's working) after >r4716 the problem appeared on ANY version of Samba or
config setting, even with or without sendfile option.
Well, i think so, never try that, but i bet the same will happen i can try and
then get back to you, i really don't know if it justify's the time "wasted"
trying to solve this knowing that's r4716 causing it, but we can try exaust all
options to achieve the problem, you can count me in on this.
Thanks once again lly
Original comment by hggo...@gmail.com
on 22 Jan 2013 at 2:28
Forget to tell you, if you use "smbd -i (interactive mode) -d 9 (debug level
9)" you will see everything what's happening and i cannot see any kind of
problems/error, it's going normal, what i can see it's the process getting
stucked.
Original comment by hggo...@gmail.com
on 22 Jan 2013 at 2:34
Another important one, all services related with Samba (smbd, nmbd, swat) when
executed go to STAT D and get stucked, so i guess it's not any kind of
specific config on smb.conf.
Original comment by hggo...@gmail.com
on 22 Jan 2013 at 3:11
I'm not familiar with Samba, so, to reproduce the problem at home, I have to
consider all choices.
Original comment by lly.dev
on 22 Jan 2013 at 3:52
That's correct lly, good luck :-)
Original comment by hggo...@gmail.com
on 22 Jan 2013 at 4:29
Original issue reported on code.google.com by
hggo...@gmail.com
on 4 Jan 2013 at 2:01