akash-network / support

Akash Support and Issue Tracking
5 stars 4 forks source link

Provider stops bidding after < 9 hours running. #92

Closed 88plug closed 1 year ago

88plug commented 1 year ago

Noticed bdl.computer stopped bidding after < 9 hours. Here is the logs when attempting create a new deployment...

I[2022-11-28|17:21:41.597] order detected                               module=bidengine-service cmp=provider order=order/akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/8676889/1/1
I[2022-11-28|17:21:41.601] group fetched                                module=bidengine-order cmp=provider order=akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/8676889/1/1
I[2022-11-28|17:21:41.601] requesting reservation                       module=bidengine-order cmp=provider order=akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/8676889/1/1
D[2022-11-28|17:21:41.602] reservation requested                        module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/8676889/1/1 resources="group_id:<owner:\"akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v\" dseq:8676889 gseq:1 > state:open group_spec:<name:\"akash\" requirements:<signed_by:<> > resources:<resources:<cpu:<units:<val:\"1000\" > > memory:<quantity:<val:\"536870912\" > > storage:<name:\"default\" quantity:<val:\"536870912\" > > endpoints:<> > count:1 price:<denom:\"uakt\" amount:\"10000000000000000000000\" > > > created_at:8676891 "
D[2022-11-28|17:21:41.602] reservation count                            module=provider-cluster cmp=provider cmp=service cmp=inventory-service cnt=2
I[2022-11-28|17:21:41.603] Reservation fulfilled                        module=bidengine-order cmp=provider order=akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/8676889/1/1
D[2022-11-28|17:21:42.844] running check                                module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash18u6zew2sg9kfwlk4rurh3m8vxlylp9jmzr25ah/8669866/1/1/akash19yhu3jgw8h0320av98h8n5qczje3pj3u9u2amp manifest-group=akash cmp=deployment-monitor attempt=1
I[2022-11-28|17:21:42.889] check result                                 module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash18u6zew2sg9kfwlk4rurh3m8vxlylp9jmzr25ah/8669866/1/1/akash19yhu3jgw8h0320av98h8n5qczje3pj3u9u2amp manifest-group=akash cmp=deployment-monitor ok=true attempt=1
D[2022-11-28|17:21:43.878] submitting fulfillment                       module=bidengine-order cmp=provider order=akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/8676889/1/1 price=11.000000000000000000uakt

The bidengine-order never gets past submitting fulfillment and that is where the logs stop for module=bidengine-order

88plug commented 1 year ago

This bug persists - today woke up to half of my providers showing exact same error and not bidding. Only fix is to manually restart the provider.

I[2023-01-06|20:34:25.996] order detected                               module=bidengine-service cmp=provider order=order/akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/9226782/1/1
I[2023-01-06|20:34:26.004] group fetched                                module=bidengine-order cmp=provider order=akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/9226782/1/1
I[2023-01-06|20:34:26.005] requesting reservation                       module=bidengine-order cmp=provider order=akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/9226782/1/1
D[2023-01-06|20:34:26.005] reservation requested                        module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/9226782/1/1 resources="group_id:<owner:\"akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v\" dseq:9226782 gseq:1 > state:open group_spec:<name:\"akash\" requirements:<signed_by:<> > resources:<resources:<cpu:<units:<val:\"1000\" > > memory:<quantity:<val:\"2147483648\" > > storage:<name:\"default\" quantity:<val:\"1073741824\" > > endpoints:<> > count:1 price:<denom:\"uakt\" amount:\"10000000000000000000000\" > > > created_at:9226784 "
D[2023-01-06|20:34:26.005] reservation count                            module=provider-cluster cmp=provider cmp=service cmp=inventory-service cnt=1
I[2023-01-06|20:34:26.005] Reservation fulfilled                        module=bidengine-order cmp=provider order=akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/9226782/1/1
D[2023-01-06|20:34:26.827] submitting fulfillment                       module=bidengine-order cmp=provider order=akash1q0m0kz83qwpuc5ss39y8sf25mq85a43ffjfd3v/9226782/1/1 price=34.000000000000000000uakt
andy108369 commented 1 year ago

Likely related to https://github.com/ovrclk/engineering/issues/673 (internal link)

andy108369 commented 1 year ago

Can't reproduce this issue nor can see it. I'm deploying nearly on the daily basis and am usually seeing -20 providers bid to my requests. As well as the providers we are managing are at 98% of capacity.

andy108369 commented 1 year ago

@88plug what's your bid timeout value in the provider? I've noticed your providers aren't expiring the bids after the default 5 mins. Most likely you have set this to a higher value

image

while your provider is holding on the bids the tenant isn't accepting, it won't bid on the new ones if the "pending" deployments holding up all resources.