berkeli / immersive-go

Creative Commons Zero v1.0 Universal
10 stars 0 forks source link

Troubleshooting Project 2 #26

Open berkeli opened 2 years ago

berkeli commented 2 years ago

https://docs.google.com/document/d/1V6HEu_OcJ3MHH-aHzUfANf06VJa1rPcGHcpBwql7QLA/edit#heading=h.h9hu29mv2qa1

berkeli commented 2 years ago
  1. Identify which partition on the disk is full

To check disks and partitions I ran the command df -h which gave me the following table:

[berkeli@ip-172-31-81-17 ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        474M     0  474M   0% /dev
tmpfs           483M     0  483M   0% /dev/shm
tmpfs           483M  464K  483M   1% /run
tmpfs           483M     0  483M   0% /sys/fs/cgroup
/dev/xvda1      8.0G  8.0G  7.9M 100% /
tmpfs            97M     0   97M   0% /run/user/1000
tmpfs            97M     0   97M   0% /run/user/1002

From this table it was clear that the disk that's full is /dev/xvda1 and I need to free up space there.

  1. Identify the biggest file on the system if any
    • To find the biggest file, I went to the root folder of the disk cd / and ran the command to check disk usage of each folder: du -sh /*
    • -s flag summarizes the size of a directory, for now I only want directories to see which one might have largest file
    • -h display in a human-readable format in MB/GB, etc
    • /* indicates to check all folders in the root directory (/)
    • This gave me a lot of permission denied errors, so I decided to run it again ignoring errors: du -sh /* 2>/dev/null
    • 2>/dev/null this indicates to send any line that exited with code 2 to the /dev/null file. The result of this command was a bit surprising, as sizes of folders listed would not add up to the ~8GB disk usage I saw earlier:
      [berkeli@ip-172-31-81-17 /]$ du -sh /* 2>/dev/null
      0   /bin
      26M /boot
      0   /dev
      18M /etc
      20K /home
      0   /lib
      0   /lib64
      0   /local
      0   /media
      0   /mnt
      112K    /opt
      0   /proc
      0   /root
      456K    /run
      0   /sbin
      0   /srv
      0   /sys
      4.0K    /tmp
      1.1G    /usr
      503M    /var
    • To verify, I checked the disk usage for the root directory du -sh / 2>/dev/null which showed me that disk usage is only 1.6GB
      [berkeli@ip-172-31-81-17 /]$ du -sh / 2>/dev/null
      1.6G    /
    • To make sure I'm not excluding any files, I ran the command again with sudo access
      [berkeli@ip-172-31-81-17 /]$ sudo du -sh / 2>/dev/null
      [sudo] password for berkeli: 
      1.8G    /

      Which showed slightly bigger disk usage, but nowhere near enough to justify 100% usage. I decided to stop looking for a large file here as the requirements suggest it might be held up by a process.

  2. File may be held open by a process To check for this I used the command lsof (list open files). Initially this gave me a huge list of files that didn't give me any clues. After taking a look at the options for lsof and digging around the internet, I found that a process can take up disk space if it opened a file and that file was subsequently deleted.

To verify this, I found an option +|-L [l], which allows us to filter by count of linked files. In my case: lsof +L1

This seemed promissing as the size matches up with the disk usage and even the process name is called findme :)

I decided to understand what it is before killing the process:

  1. I ran ps -p 3483 to get a bit more details about the process

    [berkeli@ip-172-31-81-17 sbin]$ ps -p 3483
    PID TTY          TIME CMD
    3483 ?        00:00:01 findme

    Nothing unusual or interesting here, except that it was launched with a command findme.

  2. In linux, we can check command origins with which command:

    [berkeli@ip-172-31-81-17 sbin]$ which findme
    /usr/sbin/findme

    This showed me the location of the executable.

  3. I checked the type of file with file command

    [berkeli@ip-172-31-81-17 sbin]$ file findme
    findme: POSIX shell script, ASCII text executable
  4. It's a shell script! let's check the source code:

    [berkeli@ip-172-31-81-17 sbin]$ cat findme
    #!/bin/sh
    set -e
    TMP="$(mktemp)"
    exec 3>"\$TMP"
    dd bs="1M" count="9000" if="/dev/zero" of="\$TMP" || :
    rm -f "\$TMP"
    while true; do sleep 10; done
  5. It seems to be a script with a permanent loop, so It should be safe to kill the process with sudo kill -9 3483

    • -9 is the signal ID, in this case for SIGKILL
    • 3483 is the process ID of findme which we found out earlier.
  6. I then verified disk usage again:

    [berkeli@ip-172-31-81-17 sbin]$ df -h
    Filesystem      Size  Used Avail Use% Mounted on
    devtmpfs        474M     0  474M   0% /dev
    tmpfs           483M     0  483M   0% /dev/shm
    tmpfs           483M  408K  483M   1% /run
    tmpfs           483M     0  483M   0% /sys/fs/cgroup
    /dev/xvda1      8.0G  1.9G  6.2G  24% /
    tmpfs            97M     0   97M   0% /run/user/1002

    We now have 6.2GB of free space, yay!

berkeli commented 2 years ago

Additional notes:

  1. Based on notes from Radha, I decided to relaunch the findme task and try to free up space without killing it. After launching the program in detached mod, the disk was full again:

    [berkeli@ip-172-31-81-17 8308]$  sudo lsof +L1
    COMMAND    PID USER   FD   TYPE DEVICE   SIZE/OFF NLINK    NODE NAME
    systemd-j 1740 root  txt    REG  202,1     325536     0  199663 /usr/lib/systemd/systemd-journald (deleted)
    systemd-l 2579 root  txt    REG  202,1     606928     0  199665 /usr/lib/systemd/systemd-logind (deleted)
    sh        8308 root    3w   REG  202,1 6594494464     0 8409400 /usr/bin/$TMP (deleted)
    sleep     9072 root    3w   REG  202,1 6594494464     0 8409400 /usr/bin/$TMP (deleted)
  2. Each process in Linux saves files in proc folder, and can be accessed with commands.

    • I ran sudo ls -lh /proc/8308/fd which showed me the files related to this process.
    • There's only 1 deleted file in the list:
      [berkeli@ip-172-31-81-17 /]$ sudo ls -lh /proc/8308/fd
      total 0
      lrwx------ 1 root root 64 Nov  9 14:45 0 -> /dev/pts/0
      lrwx------ 1 root root 64 Nov  9 14:45 1 -> /dev/pts/0
      lrwx------ 1 root root 64 Nov  9 14:45 2 -> /dev/pts/0
      lr-x------ 1 root root 64 Nov  9 14:45 255 -> /usr/sbin/findme
      l-wx------ 1 root root 64 Nov  9 14:45 3 -> /usr/bin/$TMP (deleted)
  3. To free up space we can truncate the file so it takes up 0 space with the following command: :>/proc/8308/fd/3

  4. Unfortunately that gave me a permission denied and running it with sudo didn't help because sudo doesn't apply to the redirection.

  5. To resolve, I ran the command in a sudo terminal via sudo sh -c ':>/proc/8308/fd/3' and this resolved the issue.

[berkeli@ip-172-31-81-17 /]$ sudo sh -c ':>/proc/8308/fd/3'
[berkeli@ip-172-31-81-17 /]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        474M     0  474M   0% /dev
tmpfs           483M     0  483M   0% /dev/shm
tmpfs           483M  408K  483M   1% /run
tmpfs           483M     0  483M   0% /sys/fs/cgroup
/dev/xvda1      8.0G  1.9G  6.2G  24% /
tmpfs            97M     0   97M   0% /run/user/1002

Let's take a look at the findme script and what it does:

#!/bin/sh
set -e

Line above instructs shell to exit if command fails (non-zero outcome)

TMP="$(mktemp)"

here we create a variable TMP and assign it the outcome of mktemp command which create a temporary folder.

exec 3>"\$TMP"

Here we redirect command outputs to TMP folder? Not 100% about this one.

dd bs="1M" count="9000" if="/dev/zero" of="\$TMP" || :

here we call the dd command which copies files.