Azure / azure-storage-fuse

A virtual file system adapter for Azure Blob storage
Other
659 stars 207 forks source link

Connection reset by peers when running Oracle RMAN backups #1306

Closed mortenjoenby closed 8 months ago

mortenjoenby commented 9 months ago

Which version of blobfuse was used?

2.1.2

Which OS distribution and version are you using?

CentOS 7.9 - 3.10.0-1160.36.2.el7.x86_64

If relevant, please share your mount command.

Mounted using a service script and config file. /usr/bin/blobfuse2 mount /rman-backup --config-file=/etc/blobfuse/blobfuseconfig.yaml

/etc/systemd/system/blobfuse2.service:

[Unit]
Description=A virtual file system adapter for Azure Blob storage.
After=network-online.target
Requires=network-online.target

[Service]
User=oracle
Group=dba
Environment=BlobMountingPoint=/rman-backup
Environment=BlobConfigFile=/etc/blobfuse/blobfuseconfig.yaml
Environment=BlobCacheTmpPath=/mnt/blobfusetmp
Environment=BlobLogPath=/var/log/blobfuse
Type=forking
ExecStart=/usr/bin/blobfuse2 mount ${BlobMountingPoint} --config-file=${BlobConfigFile}
ExecStop=/usr/bin/blobfuse2 unmount ${BlobMountingPoint}
ExecStartPre=+/usr/bin/install -d -o oracle -g dba ${BlobCacheTmpPath}
ExecStartPre=+/usr/bin/install -d -o oracle -g dba ${BlobLogPath}
ExecStartPre=+/usr/bin/install -d -o oracle -g dba ${BlobMountingPoint}

[Install]
WantedBy=multi-user.target

What was the issue encountered?

RMAN throws this error:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on C1 channel at 01/04/2024 01:00:47
ORA-19504: failed to create file "/rman-backup/step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc"
ORA-27041: unable to open file
Linux-x86_64 Error: 5: Input/output error
Additional information: 9

Checking the blobfuse2.log this happens when seeing this error:

Thu Jan  4 01:00:11 EST 2024 : blobfuse2[7728] : LOG_ERR [block_blob.go (524)]: BlockBlob::getAttrUsingList : blob step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc does not exist
Thu Jan  4 01:00:11 EST 2024 : blobfuse2[7728] : LOG_INFO [file_cache.go (684)]: FileCache::CreateFile : file=step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc, fd=13
Thu Jan  4 01:00:11 EST 2024 : blobfuse2[7728] : LOG_ERR [block_blob.go (524)]: BlockBlob::getAttrUsingList : blob step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc does not exist
Thu Jan  4 01:00:11 EST 2024 : blobfuse2[7728] : LOG_INFO [block_blob.go (827)]: BlockBlob::calculateBlockSize : step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc size 0, blockSize 268435456
Thu Jan  4 01:00:11 EST 2024 : blobfuse2[7728] : LOG_CRIT [block_blob.go (922)]: TimeTracker :: [BlockBlob::WriteFromFile] step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc => 7.552239ms
Thu Jan  4 01:00:11 EST 2024 : blobfuse2[7728] : LOG_ERR [libfuse_handler.go (673)]: Libfuse::libfuse_open : Reset flags for open step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc, fi.flags 10D002
Thu Jan  4 01:00:11 EST 2024 : blobfuse2[7728] : LOG_INFO [file_cache.go (978)]: FileCache::OpenFile : file=step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc, fd=13
Thu Jan  4 01:00:12 EST 2024 : blobfuse2[7728] : LOG_INFO [file_cache.go (804)]: FileCache::isDownloadRequired : Need to re-download step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc, but skipping as handle is already open
Thu Jan  4 01:00:12 EST 2024 : blobfuse2[7728] : LOG_INFO [file_cache.go (978)]: FileCache::OpenFile : file=step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc, fd=14
Thu Jan  4 01:00:12 EST 2024 : blobfuse2[7728] : LOG_INFO [block_blob.go (827)]: BlockBlob::calculateBlockSize : step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc size 2541146112, blockSize 4194304
Thu Jan  4 01:00:13 EST 2024 : blobfuse2[7728] : LOG_CRIT [block_blob.go (922)]: TimeTracker :: [BlockBlob::WriteFromFile] step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc => 1.642222144s
Thu Jan  4 01:00:13 EST 2024 : blobfuse2[7728] : LOG_INFO [block_blob.go (827)]: BlockBlob::calculateBlockSize : step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc size 2541146112, blockSize 4194304
Thu Jan  4 01:00:15 EST 2024 : blobfuse2[7728] : LOG_CRIT [block_blob.go (922)]: TimeTracker :: [BlockBlob::WriteFromFile] step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc => 1.686913872s
Thu Jan  4 01:00:15 EST 2024 : blobfuse2[7728] : LOG_WARNING [lru_policy.go (441)]: lruPolicy::DeleteItem : File in under download step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc
Thu Jan  4 01:00:37 EST 2024 : blobfuse2[7728] : LOG_WARNING [lru_policy.go (441)]: lruPolicy::DeleteItem : File in under download step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc
Thu Jan  4 01:00:38 EST 2024 : blobfuse2[7728] : LOG_ERR [block_blob.go (700)]: BlockBlob::ReadToFile : Failed to download blob step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc [read tcp 10.100.60.198:50323->20.209.163.1:443: read: connection reset by peer]
Thu Jan  4 01:00:38 EST 2024 : blobfuse2[7728] : LOG_CRIT [block_blob.go (701)]: TimeTracker :: [BlockBlob::ReadToFile] step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc => 23.381677706s
Thu Jan  4 01:00:38 EST 2024 : blobfuse2[7728] : LOG_ERR [file_cache.go (917)]: FileCache::OpenFile : error downloading file from storage step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc [read tcp 10.100.60.198:50323->20.209.163.1:443: read: connection reset by peer]
Thu Jan  4 01:00:39 EST 2024 : blobfuse2[7728] : LOG_ERR [libfuse_handler.go (703)]: Libfuse::libfuse_open : Failed to open step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc [read tcp 10.100.60.198:50323->20.209.163.1:443: read: connection reset by peer]
Thu Jan  4 01:00:47 EST 2024 : blobfuse2[7728] : LOG_ERR [block_blob.go (524)]: BlockBlob::getAttrUsingList : blob step/archivelog/signet-prod-1-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-04_0100_2244.log does notexist
Thu Jan  4 01:00:47 EST 2024 : blobfuse2[7728] : LOG_INFO [file_cache.go (684)]: FileCache::CreateFile : file=step/archivelog/signet-prod-1-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-04_0100_2244.log, fd=12
Thu Jan  4 01:00:47 EST 2024 : blobfuse2[7728] : LOG_ERR [block_blob.go (524)]: BlockBlob::getAttrUsingList : blob step/archivelog/signet-prod-1-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-04_0100_2244.log does not exist
Thu Jan  4 01:00:47 EST 2024 : blobfuse2[7728] : LOG_ERR [block_blob.go (524)]: BlockBlob::getAttrUsingList : blob step/archivelog/signet-prod-1-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-04_0100_2244.log does not exist
Thu Jan  4 01:00:47 EST 2024 : blobfuse2[7728] : LOG_INFO [block_blob.go (827)]: BlockBlob::calculateBlockSize : step/archivelog/signet-prod-1-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-04_0100_2244.log size 3217, blockSize 268435456
Thu Jan  4 01:00:47 EST 2024 : blobfuse2[7728] : LOG_CRIT [block_blob.go (922)]: TimeTracker :: [BlockBlob::WriteFromFile] step/archivelog/signet-prod-1-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-04_0100_2244.log => 7.928641ms

Have you found a mitigation/solution?

No. Problem is intermittent, but the file in question is then not backed up which can be critical as we might not be able to restore without it.

Please share logs if available.

Snippet above is from the logs.

vibhansa-msft commented 9 months ago
Thu Jan  4 01:00:38 EST 2024 : blobfuse2[7728] : LOG_ERR [file_cache.go (917)]: FileCache::OpenFile : error downloading file from storage step/archivelog/2024-01-04_0100_STEP_145677_1_8d2fmukr_20240104.arc [read tcp 10.100.60.198:50323->20.209.163.1:443: read: connection reset by peer]

As per this connection is being reset from the backend server. Can you check on your storage account if there were any server errors or throttling happening during this period.

mortenjoenby commented 9 months ago

Hi @vibhansa-msft . I will of course check, but I think you remember the long "discussion" we had with the backend team as well back in September when we last had issues. They did not see any problems. But I will check. As mentioned we just switched a lot of VM's from using Azure Files to Blobfuse, and we then started having these issues.

mortenjoenby commented 9 months ago

@vibhansa-msft, I don't seem to see any retries in the logs for this. Shouldn't I be seeing this?

vibhansa-msft commented 9 months ago

If you enable the rest-logs then you will see all requests and responses going to storage. Here you will see logs saying "Try=1". If this number is other than 1 then you can figure out how many retries were done. After all retries if the operation still fails then only the failure will be returned back to caller where you see the logs like "failed to upload" or "failed to download" based on your application workflow. There will not be any retries at this stage, once REST calls have been declared to fail even after retries at their end.

mortenjoenby commented 8 months ago

Hi @vibhansa-msft . I am working with MS support as well on this. We see the problem several times a day which is really frustrating. Another example:

Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_ERR [block_blob.go (516)]: BlockBlob::getAttrUsingList : blob step/controlfile/STEP_c-4255059984-20240117-23 does not exist
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_INFO [file_cache.go (684)]: FileCache::CreateFile : file=step/controlfile/STEP_c-4255059984-20240117-23, fd=13
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_ERR [block_blob.go (516)]: BlockBlob::getAttrUsingList : blob step/controlfile/STEP_c-4255059984-20240117-23 does not exist
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_INFO [block_blob.go (821)]: BlockBlob::calculateBlockSize : step/controlfile/STEP_c-4255059984-20240117-23 size 0, blockSize 268435456
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_CRIT [block_blob.go (916)]: TimeTracker :: [BlockBlob::WriteFromFile] step/controlfile/STEP_c-4255059984-20240117-23 => 8.263487ms
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_ERR [libfuse_handler.go (673)]: Libfuse::libfuse_open : Reset flags for open step/controlfile/STEP_c-4255059984-20240117-23, fi.flags 10D002
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_INFO [file_cache.go (978)]: FileCache::OpenFile : file=step/controlfile/STEP_c-4255059984-20240117-23, fd=13
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_INFO [file_cache.go (804)]: FileCache::isDownloadRequired : Need to re-download step/controlfile/STEP_c-4255059984-20240117-23, but skipping as handle is already open
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_INFO [file_cache.go (978)]: FileCache::OpenFile : file=step/controlfile/STEP_c-4255059984-20240117-23, fd=18
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_INFO [file_cache.go (991)]: FileCache::CloseFile : name=step/controlfile/STEP_c-4255059984-20240117-23, handle=0 dirty. Flushing the file.
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_INFO [block_blob.go (821)]: BlockBlob::calculateBlockSize : step/controlfile/STEP_c-4255059984-20240117-23 size 39321600, blockSize 268435456
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_CRIT [block_blob.go (916)]: TimeTracker :: [BlockBlob::WriteFromFile] step/controlfile/STEP_c-4255059984-20240117-23 => 368.606797ms
Wed Jan 17 17:00:07 UTC 2024 : blobfuse2[15320] : LOG_INFO [block_blob.go (821)]: BlockBlob::calculateBlockSize : step/controlfile/STEP_c-4255059984-20240117-23 size 39321600, blockSize 268435456
Wed Jan 17 17:00:08 UTC 2024 : blobfuse2[15320] : LOG_CRIT [block_blob.go (916)]: TimeTracker :: [BlockBlob::WriteFromFile] step/controlfile/STEP_c-4255059984-20240117-23 => 354.717419ms
Wed Jan 17 17:00:37 UTC 2024 : blobfuse2[15320] : LOG_WARNING [lru_policy.go (441)]: lruPolicy::DeleteItem : File in under download step/controlfile/STEP_c-4255059984-20240117-23
Wed Jan 17 17:01:00 UTC 2024 : blobfuse2[15320] : LOG_ERR [block_blob.go (692)]: BlockBlob::ReadToFile : Failed to download blob step/controlfile/STEP_c-4255059984-20240117-23 [read tcp 10.100.60.139:58822->20.60.7.36:443: read: connection reset by peer]
Wed Jan 17 17:01:00 UTC 2024 : blobfuse2[15320] : LOG_CRIT [block_blob.go (693)]: TimeTracker :: [BlockBlob::ReadToFile] step/controlfile/STEP_c-4255059984-20240117-23 => 52.03417987s
Wed Jan 17 17:01:00 UTC 2024 : blobfuse2[15320] : LOG_ERR [file_cache.go (917)]: FileCache::OpenFile : error downloading file from storage step/controlfile/STEP_c-4255059984-20240117-23 [read tcp 10.100.60.139:58822->20.60.7.36:443: read: connection reset by peer]
Wed Jan 17 17:01:00 UTC 2024 : blobfuse2[15320] : LOG_ERR [libfuse_handler.go (703)]: Libfuse::libfuse_open : Failed to open step/controlfile/STEP_c-4255059984-20240117-23 [read tcp 10.100.60.139:58822->20.60.7.36:443: read: connection reset by peer]
Wed Jan 17 17:01:02 UTC 2024 : blobfuse2[15320] : LOG_ERR [block_blob.go (516)]: BlockBlob::getAttrUsingList : blob step/archivelog/apple-qa-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-17_1700_12969.log does not exist
Wed Jan 17 17:01:02 UTC 2024 : blobfuse2[15320] : LOG_INFO [file_cache.go (684)]: FileCache::CreateFile : file=step/archivelog/apple-qa-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-17_1700_12969.log, fd=12
Wed Jan 17 17:01:02 UTC 2024 : blobfuse2[15320] : LOG_ERR [block_blob.go (516)]: BlockBlob::getAttrUsingList : blob step/archivelog/apple-qa-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-17_1700_12969.log does not exist
Wed Jan 17 17:01:02 UTC 2024 : blobfuse2[15320] : LOG_ERR [block_blob.go (516)]: BlockBlob::getAttrUsingList : blob step/archivelog/apple-qa-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-17_1700_12969.log does not exist
Wed Jan 17 17:01:02 UTC 2024 : blobfuse2[15320] : LOG_INFO [block_blob.go (821)]: BlockBlob::calculateBlockSize : step/archivelog/apple-qa-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-17_1700_12969.log size 3587, blockSize 268435456
Wed Jan 17 17:01:02 UTC 2024 : blobfuse2[15320] : LOG_CRIT [block_blob.go (916)]: TimeTracker :: [BlockBlob::WriteFromFile] step/archivelog/apple-qa-step-db-1.backend.mdm.stibosystems.com_step_archivelog_2024-01-17_1700_12969.log => 8.459487ms

I was thinking if it might help to increase the file_cache timeout, which is currently set to 20 secs (default I think)?

Here's our config used on all deployments:

logging:
  type: base
  level: log_info
  max-file-size-mb: 32
  file-count: 10
  track-time: true
  max-concurrency: 16
components:
  - libfuse
  - file_cache
  - azstorage

libfuse:
  default-permission: 0644
  attribute-expiration-sec: 120
  entry-expiration-sec: 120
  negative-entry-expiration-sec: 240
  ignore-open-flags: true

file_cache:
  path: /mnt/blobfusetmp
  timeout-sec: 20
  max-size-mb: 30720
  allow-non-empty-temp: true
  cleanup-on-start: true

azstorage:
  type: block
  account-name: *******
  account-key: ******
  mode: key
  container: *****

Thanks.

mortenjoenby commented 8 months ago

Can you please explain the "writeFromFile"? Logically "write from" does not make sense :)

Tue Jan 16 17:30:12 UTC 2024 : blobfuse2[29528] : LOG_CRIT [block_blob.go (916)]: TimeTracker :: [BlockBlob::WriteFromFile] step/controlfile/STEP_c-4255136787-20240116-24 => 501.514435ms
Tue Jan 16 17:30:31 UTC 2024 : blobfuse2[29528] : LOG_ERR [block_blob.go (692)]: BlockBlob::ReadToFile : Failed to download blob step/controlfile/STEP_c-4255136787-20240116-24 [read tcp 10.100.4.137:26574->20.209.109.161:443: read: connection reset by peer]
Tue Jan 16 17:30:31 UTC 2024 : blobfuse2[29528] : LOG_CRIT [block_blob.go (693)]: TimeTracker :: [BlockBlob::ReadToFile] step/controlfile/STEP_c-4255136787-20240116-24 => 19.120678215s
Tue Jan 16 17:30:31 UTC 2024 : blobfuse2[29528] : LOG_ERR [file_cache.go (917)]: FileCache::OpenFile : error downloading file from storage step/controlfile/STEP_c-4255136787-20240116-24 [read tcp 10.100.4.137:26574->20.209.109.161:443: read: connection reset by peer]
Tue Jan 16 17:30:31 UTC 2024 : blobfuse2[29528] : LOG_ERR [libfuse_handler.go (703)]: Libfuse::libfuse_open : Failed to open step/controlfile/STEP_c-4255136787-20240116-24 [read tcp 10.100.4.137:26574->20.209.109.161:443: read: connection reset by peer]
vibhansa-msft commented 8 months ago

When application is writing to a file, blobfuse persist those changes to a local file. When application closes the handle of the file or call a flush() data will be synced to the container. At this point we have all the data in a local file and objective is to upload this file to container. WriteFromFile just means "write to storage container from a local file" which is equal to "upload local file".

mortenjoenby commented 8 months ago

So "readToFile" is of course the opposite then. Reading the file from the container and writing it as a local file - in the file_cache. Is that correct? So if the file had not been removed from cache - after 20 seconds - we might not see the issue? That was my thinking when asking about the timeout value. Still download shouldn't fail though ...

vibhansa-msft commented 8 months ago

so till the timeout is valid, file remains in cache and it will not be downloaded. Once the timeout expires and file is removed from cache (you will see a log in blobfuse logs about deleting the file from cache) then next open to that file will trigger the download. If one of the file handles for a given file is not closed then file will never be deleted from cache so application has to take that responsibility of closing all the handles it open for any file.

mortenjoenby commented 8 months ago

So, I am 100% sure that Oracle RMAN is handling this just fine. This is rock-solid technology that's been working for many years. You said previously that I should enable the rest logs. Is that just using "log_debug"? Or something else? With debug it's difficult to find what you really need.

mortenjoenby commented 8 months ago

Is it only the log-level that's dynamic? Or is that applicable to ALL config options when using the config file? I have set "fuse-trace: true" under libfuse section and "sdk-trace: true" under azstorage section as well, but does it take effect dynamically?

vibhansa-msft commented 8 months ago

This one is little tricky to understand. Things which are fundamental or supplied to other dependencies like libfuse or SDK can not be changed dynamically. Things that are well within the scope of blobfuse and are not used in calculations at the beginning are something dynamic in nature. To answer your question sdk-track and libfuse-trace are config that are supplied to other libs and not handled by blobfuse natively while log level of blobfuse is something handled by blobfuse natively and can be modified dynamically.

vibhansa-msft commented 8 months ago

rest logs are sdk-trace only.

mortenjoenby commented 8 months ago

Thanks @vibhansa-msft . I will restart the service which will then unmount and mount again. For reference the ongoing SR is 2401080050003148.

vibhansa-msft commented 8 months ago

Closing this here as there is no action item on blobfuse as of now, Feel free to reopen once we have some inputs from the backend team and you need any help from blobfuse team.