bcgov / MFIN-Data-Catalogue

The Finance Data Catalogue enables users to discover data holdings at the BC Ministry of Finance and offers information and functionality that benefits consumers of data for business purposes. The product is built using Drupal and adheres to the Government of BC's Core Administrative and Descriptive etadata Standard.
Other
6 stars 0 forks source link

Time out error when uploading replacement data dictionary #481

Closed ChristaBull closed 4 months ago

ChristaBull commented 4 months ago

OP timer

https://openplus.monday.com/boards/4092908516/pulses/6665272272


Describe the bug

When uploading the attached data dictionary file (ptt-transaction_ID_87_for_upload.csv) the preview is displayed but the site times out when importing.

To reproduce

Steps to reproduce the behaviour:

  1. Go to https://cat.data.fin.gov.bc.ca/node/87/add-columns?destination=/node/87/build
  2. Select a downloaded copy of attached file as the 'Import file'
  3. Click on 'Upload'
  4. Scroll down to and click 'Import'
  5. See error

Expected behaviour

The data should be loaded and replace the existing data dictionary.

Screenshots

If applicable, add screenshots to help explain your problem.

image

Additional context

Add any other context about the problem here.

File: ptt-transaction_ID_87_for_upload.csv

CraigClark commented 4 months ago

I tested this and I was able to duplicate the issue. On https://dev.cat.data.fin.gov.bc.ca/ I was able to successfully upload the same .csv file. This is most likely a resource issue. @NicoledeGreef , this is likely something for App Ops to look into.

NicoledeGreef commented 4 months ago

Agreed- it was successful in dv14 as well so likely due to resource constraints in Production.

One thing to check could be PHP resources (maxTimeout, maxInput, etc.) but getting more OCP resource allocation quota would also help.

NicoledeGreef commented 4 months ago

Email sent to App Ops/devs re: Tech Lead contacts for the project namespace in OCP and requesting assistance in putting in the Storage increase request.

chrislaick commented 4 months ago

@NicoledeGreef I don't think this is a storage issue. ea352d-prod has 16GB storage quota, of which 9GB has been allocated as below. And you can take a look at the utilization of the 9GB: Screenshot 2024-05-24 121219

This seems to be a PHP or server config issue with timing out.

CraigClark commented 4 months ago

@NicoledeGreef I was talking to @sylus about this. He mentioned that last time we were having time outs, the solution was to increase the cpu so it executed faster, however there isn't much room left to do that again.

Another thing App Ops can investigate is determining if the problem is on the PHP side or the DB

chrislaick commented 4 months ago

@NicoledeGreef I'm going to put in a quota increase request for TEST and PROD to the next CPU bump, citing the need for more minimum 3 replicas in TEST and PROD. Once we get that approved, maybe we can increase CPU limits in the relevant pods to speed up the job. In the mean time, @sylus which pod/container would that be for us to increase CPU?

NicoledeGreef commented 4 months ago

Platform Services sent this message in response to CPU increase request:

Thank you for your recent request to increase the CPU quotas for your project. After a detailed review of the current resource utilization metrics in ea352d-test namespace, we have some important updates and recommendations.

Decision on the Quota Increase Request Based on our analysis, we have observed that the overall CPU utilization across your namespace is currently very low at 1.59%. Although there is significant CPU throttling in the PostgreSQL cluster nodes, this alone does not justify a broad increase in CPU quotas for all components of your application.

Recommendations for Resource Optimization 1. Resource Optimization: Before considering an increase in CPU quotas, we strongly recommend conducting a thorough tuning of your application’s resource usage. This includes optimizing the configuration and performance of the components that are experiencing high throttling, specifically the PostgreSQL nodes. 2. Reduce Unused CPU Limits: We recommend adjusting the CPU limits downwards for pods that are consistently using far less CPU than allocated. This adjustment will not only reduce resource wastage but also allow your Kubernetes cluster to allocate resources more efficiently. Lowering the CPU limits where possible will enable you to spin up more pods within the same resource quota, enhancing scalability and fault tolerance of your application without additional resource requests 3. Detailed Monitoring: Please continue to monitor the CPU usage and throttling metrics closely. This will help you identify specific areas where adjustments are necessary and can lead to more efficient use of resources. 4. Incremental Adjustments: Consider making incremental adjustments to the CPU limits and requests for the specific pods that show high throttling rates..

Further Guidance and Resources We encourage you to review the guidelines provided in the following documentation to better understand the process and best practices for requesting quota increases and managing resources effectively: Request Quota Increase for OpenShift Project Set

Conclusion At this time, we will hold off on approving an increase in CPU quotas. We believe that by implementing these optimizations and adjustments, you can achieve better performance and resource utilization within the current quotas. Please implement these recommendations and monitor the changes. If performance issues persist after these adjustments, we can revisit the discussion regarding additional resources.

sylus commented 4 months ago

So the problem here was ultimately OpenShifts router, whenever a request takes more then 30secs HAProxy gives you that timeout and not Nginx / Varnish. Took a little bit to figure out but I increased the time at that level using an annotation.

See the M.R. associated. :)

CraigClark commented 4 months ago

I tested this. I was able to upload the .csv @ChristaBull provided without issue to https://cat.data.fin.gov.bc.ca/node/87 without issue.

when I upload a .csv with 1000, I get a time-out error on screen, but when I look at the node, they are all there https://cat.data.fin.gov.bc.ca/node/205/edit?display=section_6&destination=/node/205/build. This is similar to the behaviour we saw when importing the information schedule.

@NicoledeGreef , I left the test record on prod for you to take a look. It should be deleted.

I have attached a couple of large files you can do tests with. extended_test_data_500.csv extended_test_data_1000.csv

CraigClark commented 4 months ago

Some more information on this

Will has set the HAProxy to 60 seconds. What happens is HAProxy is timing out, but Drupal keeps working in the background. That's why we are seeing 504 but the import still happens.

Ideally the import should be optimized. Have it run in batches of 50 or so. This is a change to the application, not the environment it runs on

CraigClark commented 4 months ago

Assigning to @lkmorlan to see if we can get this to go in batches.

Liam, please document where Chris and/or Dan can change the batch size. They may want to adjust that.

lkmorlan commented 4 months ago

The batch size can be adjusted in admin/config/data-catalogue.

NicoledeGreef commented 4 months ago

Thanks @sylus @CraigClark @lkmorlan @chrislaick .

@ChristaBull - larger column number imports should be OK now. Please try your imports and get in touch if there is any issue.