aws / aws-fpga

Official repository of the AWS EC2 FPGA Hardware and Software Development Kit
Other
1.5k stars 513 forks source link

fpga_mgmt_load_local_image_sync() failures #634

Closed ljp0101 closed 6 months ago

ljp0101 commented 7 months ago

We've been experiencing agfi loading errors starting from two days ago. This has happened 4x, 3x in one EKS cluster and 1x in a second, and we do ~30x scaling events per day so failure rate is 5-10%. Our code hasn't changed recently, which suggests to me that a change on the aws side might be causing this.

Error:

Fpga::Load: fpga_mgmt_load_local_image_sync() failed; AWS error code: -110; AWS error message: A time out error is usually spurious and the request can be retried. If

Code:

   // load bitfile
   {
      int result_check = fpga_mgmt_load_local_image_sync(0, const_cast<char*>(bitfile_name.c_str()), 1000, 30, &fpga_mgmt_image_info);
      if (result_check != 0) {
         b_error = true;
         error_message << "Fpga::Load: fpga_mgmt_load_local_image_sync() failed; ";
         error_message << "AWS error code: " << result_check << "; ";
         error_message << "AWS error message: " << fpga_mgmt_strerror_long(result_check);
         return;
      }
   }

My understanding of the function call is it's trying 1000 x 30ms so for 30s. I will try wrapping fpga_mgmt_load_local_image_sync() with retry functionality within my drivers to mitigate but 30s is a timescale that should be beyond failure.

jacobmgn commented 7 months ago

Hello, We are investigating the issue that you've reported. In order to troubleshoot this issue further, we will need you to open a support case from your AWS account. Thanks!

ljp0101 commented 7 months ago

No support plan so can't create a ticket.

These are UTC times and IPs for instances where we had problems:

Jan 31 00:19:31.771 ip-12-63-201-12.ec2.internal Jan 31 00:19:28.497 ip-12-63-14-64.ec2.internal Jan 29 23:07:06.589 ip-12-63-11-159.ec2.internal

Jan 31 17:21:55.295 ip-11-63-1-53.ec2.internal Jan 30 19:31:36.423 ip-11-63-11-107.ec2.internal

Happy to share account numbers if there's a private way to do so.

jacobmgn commented 7 months ago

The easiest path forward for us to support you on this would be for you to open an AWS support case under the AWS account where you are experiencing this issue.

That will allow us to provide support privately

jacobmgn commented 7 months ago

If you can also provide

that will assist us in identifying the issue that you are reporting

ljp0101 commented 7 months ago

Us-east-1 and all incidents & times as per prior post. I have no good way to correlate the historic IPs with instance IDs.

I supplemented our logging with instance IDs so will update if and when we see again (but I also changed the retry strategy so may have quashed our problem).

AWSjoeluc commented 7 months ago

Hello!

We have been investigating this issue. Were unable to resolve this issue with your new retry strategy, or have there been further failures? Is there anything else we can help with on this issue?

ljp0101 commented 7 months ago

The bitfile loading problem resolved with the code below. I may be seeing the same underlying problem manifest elsewhere though as I've also been having occasional communication problems with the fpga starting around the same time. Will close this issue and reopen if I can collect actionable data.

   // load bitfile
   for (uint i = 0; i < 10; ++i) {
      const int result_check = fpga_mgmt_load_local_image_sync(0, const_cast<char*>(bitfile_name.c_str()), 100, 30, &fpga_mgmt_image_info);
      if (result_check == 0) {
         break;      
      }    
      if (i == 10) {
         b_error = true;
         error_message << "Fpga::Load: fpga_mgmt_load_local_image_sync() failed; ";
         error_message << "AWS error code: " << result_check << "; ";
         error_message << "AWS error message: " << fpga_mgmt_strerror_long(result_check);
         return;
      }
      usleep(10000);
   }
ljp0101 commented 6 months ago

Checked my logs and I have an instance ID & time (UTC).

You'll note I'm an idiot and did bad error checking (retry checking should be against 9, not 10). This was a lucky mistake though as I know the FPGA worked properly despite the error response to the load function.

The left timestamps are from datadog so not completely accurate and right timestamps where available are system clock. I see the time spent in the bitfile loading function is 1708469173623 - 1708469153936 = 19.7 seconds, which is odd because I expect the elapsed time should be 10 attempts 100 sub attempts (function argument) 30ms wait (function argument) = 30 seconds.

Feb 20 22:46:03.015 [fpga-xxx] 1708469153936 InstanceId: i-0919988b3ae3b675d Feb 20 22:46:03.015 [fpga-xxx] 1708469154006 State: BOOT Feb 20 22:46:03.015 Loading fpga with bitfile named 'agfi-xxx' Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 1 Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 2 Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 3 Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 4 Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 5 Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 6 Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 7 Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 8 Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 9 Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 10 Feb 20 22:46:14.022 [fpga-xxx] 1708469173623 State: LOAD_READY Feb 20 22:46:16.022 [fpga-xxx] 1708469175447 Command: LOAD

   // load bitfile
   for (uint i = 0; i < 10; ++i) {
      const int result_check = fpga_mgmt_load_local_image_sync(0, const_cast<char*>(bitfile_name.c_str()), 100, 30, &fpga_mgmt_image_info);
      if (result_check == 0) {
         break;      
      }
      std::cout << "WARNING: fpga_mgmt_load_local_image_sync() failed, attempt " << i + 1 << std::endl;
      if (i == 10) {
         b_error = true;
         error_message << "Fpga::Load | fpga_mgmt_load_local_image_sync() failed; ";
         error_message << "AWS error code: " << result_check << "; ";
         error_message << "AWS error message: " << fpga_mgmt_strerror_long(result_check);
         return;
      }
      usleep(10000);
   }
czfpga commented 6 months ago

Hi,

I'm glad to hear that the FPGA functioned properly in this case. I'm wondering what's the returned status when the load command does not return as loaded in your retry loop. As I noticed that the function might skip the built-in retry and retry delay if the return status is neither succeed or busy (https://github.com/aws/aws-fpga/blob/863d963308231d0789a48f8840ceb1141368b34a/sdk/userspace/fpga_libs/fpga_mgmt/fpga_mgmt.c#L538-L564) assuming you're the function coming from the fpga_mgmt library.

ljp0101 commented 6 months ago

I initially observed -110, which is the stderr code for a connection timeout.

Fpga::Load: fpga_mgmt_load_local_image_sync() failed; AWS error code: -110; AWS error message: A time out error is usually spurious and the request can be retried. If

czfpga commented 6 months ago

Thanks, I think that's reason the retry time observed didn't match to the calculation. But timeout can be ignored and request should be retried. Please let me know if you have any other questions.

ljp0101 commented 6 months ago

I did retry the fpga_mgmt function call 10x with 10ms separation between function calls and it kept failing so it doesn't look like I can handle the error appropriately in any way besides terminating the instance and starting fresh.

I'll close the issue again because terminating the instance is good enough for my use case if the incident rate doesn't go above a few times per week but it seems to me like there's some more fundamental issue to be addressed as the underlying connection timeout error doesn't make much sense to me while nothing else is happening except loading the bitfile.