Azure / azure-storage-fuse

A virtual file system adapter for Azure Blob storage
Other
646 stars 204 forks source link

Problems with multithreaded reading/writing via block_cache #1426

Open olegshv opened 1 month ago

olegshv commented 1 month ago

Which version of blobfuse was used?

blobfuse2 version 2.3.0

Which OS distribution and version are you using?

Debian GNU/Linux 11 (bullseye)

If relevant, please share your mount command.

blobfuse2 mount /opt/blobfuse/fuse_container/ --config-file=/opt/blobfuse/fuse_connection.yaml

My fuse_connection.yaml file

image

What was the issue encountered?

A brief description of what we do with blogfusion2:

We are currently testing the transition on stream or block_cache. After a few days of testing, the stream was abandoned altogether, as we never managed to successfully back up. block_cache gives us hope in our hearts :)

For example, here is a 3.6TB database backup running in 5 threads in parallel. It's ok. image image

But here's an example of 15-thread jobs, where there are small databases and one large one, the load is not high, and it failed. image There is no load, but we get an error.

  1. From Windows image
  2. An error from SQL Server Management Studio: 1117(The request could not be performed because of an I/O device error.)

Unfortunately, after reading a lot of information and documentation, I did not find an answer on scaling and configurations for large loads. We ran tests even with large VMs (32cpu, 256ram), but it didn't work. I would like to get some recommendations on how to solve this case in the most reliable way. Which settings are better to use in block_cache: for such tasks?

Have you found a mitigation/solution?

No

Please share logs if available.

  1. In log_warning we can see image
  2. In log_debug:

    RESPONSE 404: 404 The specified blob does not exist. ERROR CODE: BlobNotFound

    Response contained no body


    <?xml version="1.0" encoding="utf-8"?>BlockListTooLongThe block list may not contain more than 50,000 blocks. RequestId:3a61884d-201e-0005-6461-b863c1000000 Time:2024-06-06T22:31:44.8071145Z

vibhansa-msft commented 1 month ago

How big is the file that you are trying to upload (.bak file which failed) ? Azure storage backend supports a max of 50K blocks per blob while each individual block can range from few bytes to 4GB in size. As per you config you have set block-size to 64MB. You can do the math here, if the file size divided by 64MB goes beyond 50K then its bound to fail and only workaround here is to update the block size to a point where your file fits in 50K blocks.

vibhansa-msft commented 1 month ago

Also, you have enabled disk-based caching of blocks. Do you really need this as this is going to consume space on your local disk. If your workflow is to upload the data and then read it back, then you shall continue having this but if job is to only upload the data, then you can get rid of disk-based caching completely.

olegshv commented 1 month ago

How big is the file that you are trying to upload (.bak file which failed) ? Azure storage backend supports a max of 50K blocks per blob while each individual block can range from few bytes to 4GB in size. As per you config you have set block-size to 64MB. You can do the math here, if the file size divided by 64MB goes beyond 50K then its bound to fail and only workaround here is to update the block size to a point where your file fits in 50K blocks.

Hello! Thank you for your reply. This .bak file is about 3.6 TB, but I have already transferred 5 files of 3.6 TB before and it was successful. Did I understand correctly that here I need to change the block-size to more than 64mb?

Regarding the second question, our process is that we make database backups directly to the storage via blogger and restore them from there. We are still testing the best and fastest way to do this. Therefore, I would be very grateful for your advice on how best to proceed in this case. Let me remind you that files can be larger than 4 TB.

vibhansa-msft commented 1 month ago

With 64MB block size you can upload at max 3.05TB file only. If you file sizes go beyond this then you need to bump up the block-size. You can try with 128MB which allows you to go beyond 4TB. If backup and restore are done on the same box then disk cache will help. But considering the size of your files, I beleive it's best to disable the disk cache completly (assuming you do not have 4TB free disk space to store all of the contents).

olegshv commented 1 month ago

With 64MB block size you can upload at max 3.05TB file only. If you file sizes go beyond this then you need to bump up the block-size. You can try with 128MB which allows you to go beyond 4TB. If backup and restore are done on the same box then disk cache will help. But considering the size of your files, I beleive it's best to disable the disk cache completly (assuming you do not have 4TB free disk space to store all of the contents).

Understood, thank you! Unfortunately, the tests are bad with 128 MB, but I found the article ‘About block blobs’ and now I understand that there is still room for improvement. I will test it with 256+ block size and report back with the results, maybe someone from our community will be interested.

https://learn.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs#about-block-blobs image

Could you please clarify the following questions?

  1. Do I need to add another parameter "block-size-mb:" to the "azstorage:" in my case ?
  2. If I understood and calculated correctly, then in our case, with a block size of 256 MB and simultaneous reading/writing of 15 files, the prefetch parameter should be set to a smaller value, for example 12? 2561512 = 46080 (which is less than our allocated RAM) How exactly does this parameter affect the speed and reliability of data transfer?
  3. And would it be possible to leave the parallelism parameter in the standard 128 state? If my logic is correct, then increasing this parameter simply increases the load on the CPU and potentially increases the data transfer rate over the network interface.
vibhansa-msft commented 1 month ago
vibhansa-msft commented 1 month ago

Did the above config changes help in uploading the file?

olegshv commented 1 month ago

Hello, Vikas! During these days, I conducted many experiments and found the stable configuration for our workloads.

Here is my configuration (it is used for parallel 15-18 Write/Read operations with files from 250Mb to 2.7Tb):

block_cache:
  block-size-mb: 256
  mem-size-mb: 56320
  prefetch: 20
  parallelism: 128 

Here is a graph to visualise the test with these settings image

I also ran into a strange problem, my jobs have fallen with an error: The block list may not contain more than 50,000 blocks. This is even though, according to the formula "block-size-mb * 50 000" I didn't even come close to the limit. I would like to ask, after restarting the service and starting all the backup jobs again, could that list continue to be filled and reach the 50,000 blocks limit (or maybe all the lists remained in RAM and just continued to be recorded + new ones were created, or if the file name the same the list will continue to be filled)? Please help me to clarify the reason for this.

Without going off topic, I would like to ask you if I understand architecture correctly.

If the structure I described above is correct, then the question arises again how we get the error "The block list may not contain more than 50,000 blocks." when transferring a file that is not even more than 64Mb * 50 0000 = 3,2Tb image

There is also a separate, very important question. When scaling horizontally (adding a more VMs with Blobfuse2), can we hit any limits on data transfer to the same Storage Account? From what I found in the article Storage account limits, there are default maximum ingress/egress traffic (60 Gbps/120 Gbps) to the Storage Account. At first glance, I don't see any more critical restrictions. I would be truly grateful for any additional information.

I apologize for the possibly obvious questions, but it's critical for me to understand this topic for proper use. Thank you in advance!

vibhansa-msft commented 1 month ago
olegshv commented 1 month ago

Thank you for your answers! Today, I set up blobfuse2 on one of the productive servers. This time, the behavior of the blobfuse2 was interesting, errors occurred, and the network share continued to work. After a while, it fell off. Please tell me what could be the reason and how to fix it? Blobfuse logs:

Wed Jun 12 22:54:51 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [azstorage.go (458)]: AzStorage::ReadInBuffer : Failed to read storage/backup/site1-Dump.backup.backup [numerical result out of range]
Wed Jun 12 22:54:51 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (847)]: BlockCache::download : 3 attempts to download a block have failed 378=>storage/backup/site1-Dump.backup.backup (index 0, offset 0)
Wed Jun 12 22:54:51 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (580)]: BlockCache::getBlock : Failed to download block 378=>storage/backup/site1-Dump.backup.backup (offset 0, index 0)
Wed Jun 12 22:54:51 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (493)]: BlockCache::ReadInBuffer : Failed to get Block 378=>storage/backup/site1-Dump.backup.backup offset 0 [failed to download block]
Wed Jun 12 22:54:51 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (764)]: Libfuse::libfuse_read : error reading file storage/backup/site1-Dump.backup.backup, handle: 378 [failed to download block]
Wed Jun 12 22:56:36 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_blob.go (1462)]: BlockBlob::CommitBlocks : Failed to commit block list to blob storage/backup/mssql-site-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidInput</Code><Message>One of the request inputs is not valid.
RequestId:e9f99714-001e-0050-3502-bd194d000000
Time:2024-06-12T19:56:36.1653073Z</Message></Error>
--------------------------------------------------------------------------------
Wed Jun 12 22:56:55 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_blob.go (1462)]: BlockBlob::CommitBlocks : Failed to commit block list to blob storage/backup/mssql-site-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidInput</Code><Message>One of the request inputs is not valid.
RequestId:c077b8a3-001e-0032-5902-bddb6a000000
Time:2024-06-12T19:56:55.9598214Z</Message></Error>
--------------------------------------------------------------------------------

Wed Jun 12 22:57:19 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (439)]: BlockCache::CloseFile : failed to flush file storage/backup/mssql-site-diff.bak
Wed Jun 12 22:57:19 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (867)]: Libfuse::libfuse_release : error closing file storage/backup/mssql-site-diff.bak, handle: 2036 [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidInput</Code><Message>One of the request inputs is not valid.
RequestId:bf187da8-301e-0074-4b02-bdefed000000
Time:2024-06-12T19:57:19.7806094Z</Message></Error>
--------------------------------------------------------------------------------
]
Wed Jun 12 22:57:31 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_blob.go (1462)]: BlockBlob::CommitBlocks : Failed to commit block list to blob Cstorage/backup/mssql-site2-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
]
Wed Jun 12 22:57:31 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (409)]: BlockCache::FlushFile : Failed to commit blocks for storage/backup/mssql-site2-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------
]
Wed Jun 12 22:57:31 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (1386)]: BlockCache::SyncFile : failed to flush file storage/backup/mssql-site2-diff.bak
Wed Jun 12 22:57:31 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (1062)]: Libfuse::libfuse_fsync : error syncing file storage/backup/mssql-site2-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
]
Wed Jun 12 22:57:54 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (815)]: Libfuse::libfuse_flush : error flushing file /storage/backup/mssql-site2-diff.bak, handle: 2066 [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------
]
Wed Jun 12 23:00:12 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (1229)]: BlockCache::commitBlocks : Failed to commit blocks for storage/backup/mssql-site2-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------

TeamCity logs when it happand:

MSSQL error: The operating system returned the error '1117(The request could not be performed because of an I/O device error.)' while attempting 'FlushFileBuffers' on '\\my-blobfuse\backup\mssql-site2-diff.bak'.
22:56:55               BACKUP DATABASE is terminating abnormally.
23:06:27           pg_dump: error: could not write to output file: Input/output error
23:06:31   Error message: pg_dump: error: could not write to output file: Input/output error

My config (VM 16CPU 128RAM):

allow-other: true

logging:
    type: base
    level: log_warning
    file-path: '/var/log/blobfuse2/blobfuse2.log'

components:
    - libfuse
    - block_cache
    - attr_cache
    - azstorage

libfuse:
   default-permission: 0777
   attribute-expiration-sec: 240
   entry-expiration-sec: 240
   negative-entry-expiration-sec: 240

attr_cache:
   timeout-sec: 3600

azstorage:
    type: adls
    account-name: *****
    account-key: *****
    mode: key
    container: ******
    tier: cool

block_cache:
  block-size-mb: 256
  mem-size-mb: 112640
  prefetch: 20
  parallelism: 128

I'm not sure, but it might be useful to see the load graphs. After a failed backup, I interrupted the process and repeated with different configurations. image

UPD I have added a new debug log, maybe it can help blobfuse2-new.zip

olegshv commented 1 month ago

Hello! I've spent all these days testing the problem. Unfortunately, the issues do not disappear with different configurations. To better structure everything, I wrote down the debug logs in the "minimal" configuration in the hope that this will help you solve the problem.

In this case, three cases with 64, 128 and 256 block-size-mb were recorded. We ran 8 database backups at the same time. With block-size-mb 64 and 128, the network share crashed. With 256 block-size-mb the network share was working, but the backup of MSSQL databases was down.

The operating system returned the error '1117(The request could not be performed because of an I/O device error.)' while attempting 'FlushFileBuffers' on '\\mystorage\mystorage\Client\BackupFolder\site2\site2-db-20240617-175314-812-full.bak'.
BACKUP DATABASE is terminating abnormally.
Sizes of database backups:
site1 2.3Tb
site2 166Gb
site3 1.3Gb
site4 24Gb
site5 642Mb
site6 246Gb
site7 374Mb
site8 236Mb
site9 253Gb

My config (Standard E8d v5 (8 vcpus, 64 GiB memory)):

logging:
    type: base
    level: log_debug
    file-path: '/var/log/blobfuse2/blobfuse2.log'

allow-other: true

azstorage:
    type: adls
    account-name: ******
    account-key: ******
    mode: key
    container: ******
    tier: cool
    max-retries: 10

components:
    - libfuse
    - block_cache
    - attr_cache
    - azstorage

health_monitor:
    enable-monitoring: true
    stats-poll-interval-sec: 30
    output-path: /var/log/blobfuse2

monitor-disable-list:
    - memory_profiler
    - cpu_profiler
    - file_cache_monitor

libfuse:
  attribute-expiration-sec: 240
  entry-expiration-sec: 240
  negative-entry-expiration-sec: 240
..
block_cache:
  block-size-mb:. 64 ... 128 ... 256
  mem-size-mb: 56320
  prefetch: 12

attr_cache:
  timeout-sec: 7200

I am attaching the logs of three scenarios. If you need to test other scenarios, we are ready to help in any way we can. Thank you!

  1. blobfuse2-test (64-12).zip
  2. blobfuse2-test (128-12).zip
  3. blobfuse2-test (256-12)2.zip
vibhansa-msft commented 1 month ago
Wed Jun 12 22:57:54 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (815)]: Libfuse::libfuse_flush : error flushing file /storage/backup/mssql-site2-diff.bak, handle: 2066 [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------

This is not due to large block-size or 50K blocks exhaustion. Sound something different and may be from backend its failing for some other reason. Is there any sort of storage account Metrix pointing any specific error.

vibhansa-msft commented 1 month ago

Is this the only error you are hitting or there is something else. Due to the long messages and all I might have overlooked some part so just wanted to understand the real issue you are facing here.

vibhansa-msft commented 1 month ago

Any updates on this issue ?

olegshv commented 4 weeks ago

Hello! I apologise for the long silence, I was on holiday. Unfortunately, we didn't have StorageAccount logging set up.

Today I created a Log Analytics workspace and ran backups for a new test in the configuration (block-size-mb: 128, mem-size-mb: 56320, prefetch: 12). We see the following picture: image image

Please tell me what kind of logs you need for analysis.

As for your question about the types of errors in the logs.

Wed Jun 12 22:57:54 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (815)]: Libfuse::libfuse_flush : error flushing file /storage/backup/mssql-site2-diff.bak, handle: 2066 [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------

This is not due to large block-size or 50K blocks exhaustion. Sound something different and may be from backend its failing for some other reason. Is there any sort of storage account Metrix pointing any specific error.

Here are more examples of errors that we see in the logs:

Mon Jun 17 13:46:43 EEST 2024 : blobfuse2[5173] : [/opt/blobfuse/fuse_container] LOG_ERR [block_blob.go (1331)]: BlockBlob::StageAndCommit : Failed to stage to blob Client/BackupFolder/site5/site5-db-20240617-104510-809-full.bak with ID OTVhYTgxOGItZjAxMS00YjQ0LTcyYWYtMTcwYzBiYWM3MGE3 at block 654311424 [PUT https://mystorage.blob.core.windows.net/mystorage/Client/BackupFolder/site5/site5-db-20240617-104510-809-full.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 The specified blob or block content is invalid.
ERROR CODE: InvalidBlobOrBlock
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidBlobOrBlock</Code><Message>The specified blob or block content is invalid.
RequestId:c4597294-801e-003c-43a3-c0f2da000000
Time:2024-06-17T10:46:43.8223452Z</Message></Error>
--------------------------------------------------------------------------------
]
Mon Jun 17 13:46:43 EEST 2024 : blobfuse2[5173] : [/opt/blobfuse/fuse_container] LOG_ERR [block_blob.go (1160)]: BlockBlob::TruncateFile : Failed to stage and commit file Client/BackupFolder/site5/site5-db-20240617-104510-809-full.bak%!(EXTRA string=PUT https://mystorage.blob.core.windows.net/mystorage/Client/BackupFolder/site5/site5-db-20240617-104510-809-full.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 The specified blob or block content is invalid.
ERROR CODE: InvalidBlobOrBlock
--------------------------------------------------------------------------------
Mon Jun 17 13:45:18 EEST 2024 : blobfuse2[5173] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(ResponseError) : HEAD https://mystorage.blob.core.windows.net/mystorage/Client/BackupFolder/site7
--------------------------------------------------------------------------------
RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound
--------------------------------------------------------------------------------
Mon Jun 17 16:35:39 EEST 2024 : blobfuse2[17024] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(ResponseError) : GET https://mystorage.blob.core.windows.net/mystorage/Client/BackupFolder/site9/site9-db-20240617-104615-7185-v16-Dump.backup
--------------------------------------------------------------------------------
RESPONSE 416: 416 The range specified is invalid for the current size of the resource.
ERROR CODE: InvalidRange
--------------------------------------------------------------------------------
olegshv commented 4 weeks ago

Hi! Finally, we got to the bottom of the cause of the "FlushFileBuffers" error: it is caused by MSSQL backup COMPRESSION. And as far as I understand, it is impossible to combine block_cache and backup compression. We will continue testing. Thank you!

image image

vibhansa-msft commented 4 weeks ago

Can you share more details on why backup compression and block-cache can not be combined? If the data is compressed and then stored as a file to storage using block-cache it shall work just fine. If the data is written to a file and then compression is applied on that data (which is now residing in cloud) then there is a huge cost of downloading the data agian and then compressing it and reuploading the compressed data. What is the exact use-case here?

olegshv commented 4 weeks ago

I'm not ruling out that I could be wrong, but it looks like this... https://learn.microsoft.com/en-us/sql/relational-databases/backup-restore/backup-compression-sql-server?view=sql-server-2016#Allocation image

vibhansa-msft commented 4 weeks ago

Truncation of a file to larger or smaller offset is something that we support provided that file was closed and uploaded previously. If application tries to truncate a file in transition then it might lead to a failure.

olegshv commented 3 weeks ago

Unfortunately, tests strongly suggest that it does not work with MSSQL compression.

I turned off the compression and run tests, and again problems, with MSSQL so far ok, but not with Postgresql :( pg_dump: error: could not open file "/mnt/DailyClientBackup/site1-20240627-132357/site1/site1-dump.backup": Input/output error

Thu Jun 27 16:25:57 EEST 2024 : blobfuse2[798] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (357)]: BlockCache::OpenFile : Block size mismatch for Client/SQLDailyBackup/site1/site1-db-20240627-132357-813-v16-Dump.backup [block: JeHoTH2cS8dr294cbrYiIQ==, size: 24395776]
Thu Jun 27 16:25:57 EEST 2024 : blobfuse2[798] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (710)]: Libfuse::libfuse_open : Failed to open Client/SQLDailyBackup/site1/site1-db-20240627-132357-813-v16-Dump.backup [block size mismatch for Client/SQLDailyBackup/site1/site1-db-20240627-132357-813-v16-Dump.backup]

Debug logs: blobfuse2log.txt and now we have some logs from the Azure: image

olegshv commented 3 weeks ago

Update. Yesterday's issue was resolved after changing the ACL rules to a directory in the storage via Azure Storage Explorer. This is very strange since the rules were set correctly anyway, we will continue to monitor.

Another critical issue is the network share crash. This problem has been haunting us since the very beginning of testing, and so far, I have not been able to understand what is causing it.

We were running 6 backups in parallel. After about 5 minutes, the network share crashed.

image

These are the logs we see in the last seconds before the network share became unavailable.

image

image

Fri Jun 28 01:56:11 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Retry) : response 201
Fri Jun 28 01:56:11 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Retry) : exit due to non-retriable status code
Fri Jun 28 01:56:11 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:11 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:11 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [block_cache.go (1095)]: BlockCache::waitAndFreeUploadedBlocks : Block cleanup for block 731=>Client/BackupFolder/site1/site1-db-diff.bak (index 69, offset 9261023232)
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [block_cache.go (1049)]: BlockCache::lineupUpload : Upload block 731=>Client/BackupFolder/site1/site1-db-diff.bak (index 86, offset 11542724608, data 134217728)
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [block_blob.go (1422)]: BlockBlob::StageBlock : name Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Retry) : =====> Try=1
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Request) : ==> OUTGOING REQUEST (Try=1)
   PUT https://mystore.blob.core.windows.net/mycontainer/Client%2FBackupFolder%2Fsite1%2Fsite1-db-diff.bak?blockid=iQxbQx0HQPtPoJxTDstH%2BQ%3D%3D&comp=block
   Accept: application/xml
   Authorization: REDACTED
   Content-Length: 134217728
   Content-Type: application/octet-stream
   User-Agent: Azure-Storage-Fuse/2.3.0 (Debian GNU/Linux 11 (bullseye)) azsdk-go-azblob/v1.3.2 (go1.22.1; linux)
   X-Ms-Date: Thu, 27 Jun 2024 22:56:12 GMT
   x-ms-version: 2023-11-03

Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [block_cache.go (1095)]: BlockCache::waitAndFreeUploadedBlocks : Block cleanup for block 731=>Client/BackupFolder/site1/site1-db-diff.bak (index 70, offset 9395240960)
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [block_cache.go (1049)]: BlockCache::lineupUpload : Upload block 731=>Client/BackupFolder/site1/site1-db-diff.bak (index 87, offset 11676942336, data 134217728)
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [block_blob.go (1422)]: BlockBlob::StageBlock : name Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Retry) : =====> Try=1
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Request) : ==> OUTGOING REQUEST (Try=1)
   PUT https://mystore.blob.core.windows.net/mycontainer/Client%2FBackupFolder%2Fsite1%2Fsite1-db-diff.bak?blockid=T%2Ff10O7YSj1hwuzlWbIZFQ%3D%3D&comp=block
   Accept: application/xml
   Authorization: REDACTED
   Content-Length: 134217728
   Content-Type: application/octet-stream
   User-Agent: Azure-Storage-Fuse/2.3.0 (Debian GNU/Linux 11 (bullseye)) azsdk-go-azblob/v1.3.2 (go1.22.1; linux)
   X-Ms-Date: Thu, 27 Jun 2024 22:56:12 GMT
   x-ms-version: 2023-11-03

Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Response) : ==> REQUEST/RESPONSE (Try=1/1.989390259s, OpTime=1.989417219s) -- RESPONSE RECEIVED
   PUT https://mystore.blob.core.windows.net/mycontainer/Client%2FBackupFolder%2Fsite1%2Fsite1-db-diff.bak?blockid=M4kx5pwnSG1%2BhhGDzHZ7JQ%3D%3D&comp=block
   Accept: application/xml
   Authorization: REDACTED
   Content-Length: 134217728
   Content-Type: application/octet-stream
   User-Agent: Azure-Storage-Fuse/2.3.0 (Debian GNU/Linux 11 (bullseye)) azsdk-go-azblob/v1.3.2 (go1.22.1; linux)
   X-Ms-Date: Thu, 27 Jun 2024 22:56:10 GMT
   x-ms-version: 2023-11-03
   --------------------------------------------------------------------------------
   RESPONSE Status: 201 Created
   Content-Length: 0
   Date: Thu, 27 Jun 2024 22:56:11 GMT
   Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
   X-Ms-Content-Crc64: iS+krsn+pYw=
   X-Ms-Request-Id: 13793667-001e-0040-24e5-c8dc25000000
   X-Ms-Request-Server-Encrypted: REDACTED
   X-Ms-Version: 2023-11-03
vibhansa-msft commented 3 weeks ago

By network share crash you mean blobfuse mount path becomes unstable after some time? When you say 5 backups running in parallel are they running for the same path or you have 5 different mounts? Also, if crash is consistent you can use "--foreground" cli option and when it crashes it will dump some logs on console, kindly share those. In foreground mode mount will block the console so you have to run backup or other things from a different console.

olegshv commented 3 weeks ago

Yes, I mean that for some reason the mounting point disappears, while on the graph we see the RAM being cleared. image

Run in parallel - I mean that all backups are made to a single mount point.

I added the option --foreground and this is what we have after the crash: image

olegshv commented 3 weeks ago

Hi, just wondering if these logs helped you in any way? I'm worried if this can be fixed quickly at all. If there's anything else, I'm ready for shamanic dances and prayers :) Thank you!

souravgupta-msft commented 3 weeks ago

Hi @olegshv. Can you also please share the debug logs for the above crash? We are trying to replicate this error. Is this error happening frequently for your use case?

olegshv commented 3 weeks ago

@souravgupta-msft hi! Yes, of course. Here are the logs: blobfuse2_updated.log

I would say that in this configuration, every launch will crash. Depending on the configuration parameters, the crash can occur at different times. I'll record some more logs for you.

upd It fell again, with the following configurations: image

image