Azure / azure-storage-fuse

A virtual file system adapter for Azure Blob storage
Other
674 stars 209 forks source link

No space left on device #1560

Open sandip094 opened 2 weeks ago

sandip094 commented 2 weeks ago

Which version of blobfuse was used?

Which OS distribution and version are you using?

What was the issue encountered?

Getting the below error after running for few minutes on the RMAN backup released channel: C1 RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of backup plus archivelog command at 11/07/2024 13:09:32 ORA-19502: write error on file "/rman-backup/step/1/2024-10-20_0115/STEP_1736_1_m839hd9q_20241107.incr1c", block number 91441152 (block size=8192) ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 28: No space left on device Additional information: 4294967295 Additional information: 1048576

Configuration file is as below -/etc/blobfuse/blobfuseconfig.yaml `logging: type: base level: log_info max-file-size-mb: 32 file-count: 10 track-time: true max-concurrency: 40 components:

libfuse: default-permission: 0644 attribute-expiration-sec: 120 entry-expiration-sec: 120 negative-entry-expiration-sec: 240 ignore-open-flags: true

file_cache: path: /mnt/blobfusetmp timeout-sec: 20 max-size-mb: 30720 allow-non-empty-temp: true cleanup-on-start: true

azstorage: type: block account-name: xxxxx account-key: xxxxx mode: key container: xxxxx`

Service file content -/etc/systemd/system/blobfuse2.service `[Unit] Description=A virtual file system adapter for Azure Blob storage. After=network-online.target Requires=network-online.target

[Service] User=oracle Group=dba Environment=BlobMountingPoint=/rman-backup Environment=BlobConfigFile=/etc/blobfuse/blobfuseconfig.yaml Environment=BlobCacheTmpPath=/mnt/blobfusetmp Environment=BlobLogPath=/var/log/blobfuse Type=forking ExecStart=/usr/bin/blobfuse2 mount ${BlobMountingPoint} --config-file=${BlobConfigFile} ExecStop=/usr/bin/blobfuse2 unmount ${BlobMountingPoint} ExecStartPre=+/usr/bin/install -d -o oracle -g dba ${BlobCacheTmpPath} ExecStartPre=+/usr/bin/install -d -o oracle -g dba ${BlobLogPath} ExecStartPre=+/usr/bin/install -d -o oracle -g dba ${BlobMountingPoint}

[Install] WantedBy=multi-user.target`

Backup files size is as follows: 28M control01.ctl 8.1G stepsysblob_step_1.dbf 743G stepsysdata_step_1.dbf 4.6G sysaux_step_1.dbf 801M system_step_1.dbf 20G temp_step_1.dbf 80G undo_t1_step_1.dbf 101M users_step_1.dbf

vibhansa-msft commented 2 weeks ago

"ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 28: No space left on device Additional information: 4294967295 Additional information: 1048576" : Kindly check the disk usage of "/mnt/blobfusetmp". Logs indicate the disk might be running out of space. I see you have kept 20 seconds as disk timeout and ~30GB disk space. If your application (RMAN in your case) generates more data than this limit in the given time frame the disk might just exhaust.

sandip094 commented 2 weeks ago

Hello @vibhansa-msft , I have this much of temp available. So what are your recommendation ? How does this calculation happens,to change these things "20 seconds as disk timeout and ~30GB disk space." Image

vibhansa-msft commented 2 weeks ago
timeout-sec: 20
max-size-mb: 30720

30GB space and 20 second timeout is something that you have configured in the .yaml file. If you have 600+ GB of disk space available you can increase the limit from 30GB to 100 may be and also reduce the timeout from 20 to 0 or 2 seconds. Timeout is useful only when your application reads the same file again and again. If process is going to read a file only once keeping the timeout to 0 saves the disk usage.

Also, Blobfuse deletes a file from local cache only if all open handles for the given file are closed. If your application does not close the handle then the file will remain in cache untill you mount. In such cases as well you will observe the disk is getting full. If you suspect this you can force a hard limit where your file open calls will start to fail if the disk is reaching configured capacity.

sandip094 commented 2 weeks ago

Hello @vibhansa-msft , Post changing the mentioned values backup still failed with no space error. Observations:

  1. /mnt becomes 100% in no time
  2. /rman-backup becomes 100G which it shouldnt be Image Image
vibhansa-msft commented 2 weeks ago

How big is the backup you are trying to take? 'df' command showing 100G in /rman-backup is not your container or data upload size. It just shows the configured size for your disk for temp cache and its usage. As per this your temp cache is 100% full which means either the files are not being closed by RMAN or it's generating too much of data in a short span of time. Can you enabel debug logs and share the log file with us, it will be easier that way to rule out possibility of not closing the file part.

sandip094 commented 2 weeks ago

For some reason the debug log file is not getting generated [root@asose2e798c623453573167ad8162-db-1 bin]# cd /var/log/blobfuse/ [root@asose2e798c6273167ad8162-db-1 blobfuse]# ls -ltr total 0 [root@asose2e798c623453573167ad8162-db-1 blobfuse]# cat /etc/blobfuse/blobfuseconfig.yaml | grep level level: LOG_DEBUG

vibhansa-msft commented 1 week ago

If you have syslog filters installed it shall be in '/var/log/blobfuse2.log' file, otherwise by default it will go to '/var/log/messages'. If you are using AKS then logs might be directed to the pod directory created on the node.

sandip094 commented 1 week ago

Hello @vibhansa-msft , Please find the attached logs .

blobfuse.log

Regards Sandeep

vibhansa-msft commented 1 week ago
Nov 12 11:36:15 asose2e798c6273167ad8162-db-1 blobfuse2[3964550]: Error: fusermount3: entry for /rman-backup not found in /etc/mtab
Nov 12 11:36:15 asose2e798c6273167ad8162-db-1 blobfuse2[3964550]: exit status 1
Nov 12 11:36:31 asose2e798c6273167ad8162-db-1 blobfuse2[3964568]: [/rman-backup] LOG_CRIT [mount.go (432)]: Starting Blobfuse2 Mount : 2.3.2 on [Oracle Linux Server 8.7]
Nov 12 11:36:31 asose2e798c6273167ad8162-db-1 blobfuse2[3964568]: [/rman-backup] LOG_CRIT [mount.go (434)]: Logging level set to : LOG_WARNING
Nov 12 11:36:31 asose2e798c6273167ad8162-db-1 blobfuse2[3964568]: [/rman-backup] LOG_ERR [file_cache.go (239)]: FileCache: config error [tmp-path not set]
Nov 12 11:36:31 asose2e798c6273167ad8162-db-1 blobfuse2[3964568]: [/rman-backup] LOG_ERR [pipeline.go (69)]: Pipeline: error creating pipeline component file_cache [config error in file_cache error [tmp-path not set]]
Nov 12 11:36:31 asose2e798c6273167ad8162-db-1 blobfuse2[3964568]: [/rman-backup] LOG_ERR [mount.go (442)]: mount : failed to initialize new pipeline [config error in file_cache error [tmp-path not set]]
Nov 12 11:39:38 asose2e798c6273167ad8162-db-1 blobfuse2[3964744]: Error: directory is already mounted

This is syslog file and has many logs other than blobfuse. Last few logs from blobfuse end I could see in here are just about failing to mount due to invalid path.

sandip094 commented 6 days ago

Hello @vibhansa-msft , My bad attached the wrong file ealier.

Got some more information around this one:

sandip094 commented 6 days ago

Attached the latest logs blobfuse3.zip

vibhansa-msft commented 5 days ago

If you are dealing with file as large as 800GB then file-cache is not advised. Kindly migrate to block-cache model and then try your workflow again.

sandip094 commented 5 days ago

Hello @vibhansa-msft , Thanks for the suggestion.I have switched to block-cache model and getting a different error now . Attached the debug log: blobfuse2-block.log

Here is my config file: `[oracle@asose2e798c6273167ad8162-db-1 .blobfuse2]$ cat /etc/blobfuse/blobfuseconfig.yaml

Refer ./setup/baseConfig.yaml for full set of config parameters

allow-other: false

logging: type: base level: log_debug

components:

libfuse: attribute-expiration-sec: 120 entry-expiration-sec: 120 negative-entry-expiration-sec: 240

block_cache: block-size-mb: 32 mem-size-mb: 4096 prefetch: 80 parallelism: 128

attr_cache: timeout-sec: 7200

azstorage: type: block account-name: xx account-key:xx mode: key container: xx [oracle@asose2e798c6273167ad8162-db-1 .blobfuse2]$ nproc 20 `

vibhansa-msft commented 4 days ago

How did you upload the files to your storage account:

Thu Nov 21 07:07:42 UTC 2024 : blobfuse2[288824] : [/rman-backup] LOG_ERR [block_cache.go (384)]: BlockCache::validateBlockList : Block size mismatch for step/archivelog/2024-11-21_0707_STEP_1788_1_ns3alqpj_20241121.arc [block: KWq2zj+jR0BVj1jYTi4BwQ==, size: 512]
Thu Nov 21 07:07:42 UTC 2024 : blobfuse2[288824] : [/rman-backup] LOG_ERR [libfuse_handler.go (712)]: Libfuse::libfuse_open : Failed to open step/archivelog/2024-11-21_0707_STEP_1788_1_ns3alqpj_20241121.arc [block size mismatch for step/archivelog/2024-11-21_0707_STEP_1788_1_ns3alqpj_20241121.arc]

I see block-cache is not able to open this file as the block-size in your config file is set to 32mb and this particular file has smaller block-size. As of now block-cache only works for files that have exactly the same block size on backend. If objective of your workflow is to just read the file then mount blobfuse in read-only mode and it will stop making this strict check. If you wish to overwrite the file then this might not work with block-cache for now unless you create the file with block-cache initially.

sandip094 commented 1 day ago

Hello @vibhansa-msft , Currently my block size for the oracle files is 8192 SQL> SELECT TABLESPACE_NAME, BLOCK_SIZE FROM DBA_TABLESPACES; 2 TABLESPACE_NAME BLOCK_SIZE


STEPSYSBLOB 8192 STEPSYSDATA 8192 SYSAUX 8192 SYSTEM 8192 TEMP 8192 UNDO_T1 8192 USERS 8192

And CPU is [oracle@asose2e798c6273167ad8162-db-1 .blobfuse2]$ nproc 20

[oracle@asose2e798c6273167ad8162-db-1 .blobfuse2]$ free -h total used free shared buff/cache available Mem: 157Gi 60Gi 92Gi 121Mi 4.5Gi 95Gi

So with this what should my config looks like ?

vibhansa-msft commented 1 day ago

As per the below log, there is a block in your file which is 512 size. If this was the last block, Blobfuse2 would have allowed it and file open would have been success. But either it's in between block or all blocks in the file following this are of smaller size hence the open fails. You need to validate how this file was created in the first place.

[block: KWq2zj+jR0BVj1jYTi4BwQ==, size: 512]
mortenjoenby commented 1 day ago

@vibhansa-msft , what if different files are using different block sizes? We are using Oracle RMAN to do the backups, and I believe the blocksize of the archived redo logs is 512 bytes, but on other files it's 8KB.

mortenjoenby commented 1 day ago

We (I am working with @sandip094) have been using file-cache for quite some time now, but I am wondering when you would suggest using streaming block-cache mode? We would like to use the same mode for ALL setups (we have quite many) no matter the size of the database. I was looking at your "decision tree" here - https://github.com/Azure/azure-storage-fuse?tab=readme-ov-file#config-guide - and it seems with very large files that block-cache mode is the right thing, but I am not sure ...