dart-lang / pub-dev

The pub.dev website
https://pub.dev
BSD 3-Clause "New" or "Revised" License
773 stars 147 forks source link

pub.dev is down (global outage) #4663

Closed timsneath closed 3 years ago

timsneath commented 3 years ago

image

Overview

At approximately 7pm, the pub infrastructure began to respond with a HTTP 502 Site Error message. The site infra is deployed to the us-central GCP region. This impacts core Flutter services, specifically:

Please do not reply to this bug with "me too" or +1 messages, it makes it harder for folk to track. Thanks!

Updates

7:10pm Pacific We're currently experiencing an outage on pub.dev, which appears to be related to a load balancer issue. We don't have an ETA for a resolution at this time; we're currently working to understand the issue.

7:59pm Pacific The pub.dev site is still down. We have a Google on-call engineer currently investigating. We have not yet identified a root cause.

8:15pm Pacific We apologize for the inconvenience. We're seeing load balancer errors and are escalating to the appropriate team. Still no ETA, unfortunately, since we're still haven't determined the root cause.

8:27pm Pacific We have multiple Google Cloud engineers on-call investigating, but I'm sorry to report that we still don't have a root cause. We'll continue to post updates regularly. Thank you for your patience.

9:00pm Pacific We are continuing to debug the problem. We have declared a Google escalated outage while we attempt to identify the root cause. Some folk have been successful using the Chinese mirror site at https://pub.flutter-io.cn.

9:20pm Pacific Again, apologies.

9:35pm Pacific We are currently exploring the theory that we have exceeded a quota, but that the error didn't show in the log. Paging an oncall team to try and increase the quota to see if this resolves. Again, this really sucks -- we recognize that it's a major inconvenience to you all, and we're feeling sick that we're down. Thank you for being patient with us :(

9:45pm Pacific We have updated the quota and are resetting the VM instances, to see if we have successfully identified the root cause.

9:51pm Pacific We are seeing evidence of partially restored service.

9:55pm Pacific The pub service appears to be fully restored.

10:15pm Pacific Here's what we think we know at this point in time. At some point within the last day or two, a change was made to the pub.dev landing page that includes a call to the YouTube API. There is a quota limit for YouTube calls that we didn't hit over the last few days, but today we hit it. Confounding the issue, the code was missing exception handling and the logging was inadequate or obfuscated sufficiently that we were unable to immediately spot the problem. The immediate resolution was to raise the quota temporarily to give us time to revert the original change.

At this time we think the issue is resolved, but we'll obviously be monitoring closely. Again, apologies on behalf of the Flutter & Dart teams for the disruption. We take this very seriously, and we will perform a full post-mortem and share the learnings and actions we'll take as a result of this.

huynhchicuong commented 3 years ago

gz team! 🎉🎉

vinothvino42 commented 3 years ago

Congratulations Team 👏

ensaryusuf commented 3 years ago

Everyone to the beginning of the code. 😂 Pub.dev is now stable.

H-Zaman commented 3 years ago

GGWP

HasanAlyazidi commented 3 years ago

Up and running, thank you

goose-intestine commented 3 years ago

Congrats Team

hienlh commented 3 years ago

Many thanks to all Flutter team

arteevraina commented 3 years ago

You are superheroes.

daiki1003 commented 3 years ago

Thanks all :)

DevKhalyd commented 3 years ago

Working 👍

bhtri commented 3 years ago

Thank you very much

3lVv0w commented 3 years ago

Working 🙏

doyle-flutter commented 3 years ago

Thanks !!! 🙏 'pub get' working ! / Pub.dev is accessable in KO. image

saswatsaubhagya commented 3 years ago

Working , Thank you.

luoxufeiyan commented 3 years ago

Pub.dev is now accessable in Australia.

HTTP/1.1 303 See Other
Date: Thu, 25 Mar 2021 04:57:07 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 0
location: https://pub.dev/
x-frame-options: SAMEORIGIN
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
server: dart:io with Shelf
Via: 1.1 google
wwtssu commented 3 years ago

Working!!! 🙏

sudoaccess commented 3 years ago

uuuuuuuppppppp

ensaryusuf commented 3 years ago

@timsneath What was the source of the problem?

kxviel commented 3 years ago

pub.dev seems to back, but pub get is aint working UPDATE: pub get works now (india)

hjleesm commented 3 years ago

👍

Nanra commented 3 years ago

pub.dev is now accessible from Indonesia. Great 👍🏻✨

Screen Shot 2021-03-25 at 12 07 46
timsneath commented 3 years ago

Thank you all. I've posted a quick update at the top of this bug, but in summary services should now be resolved. We've identified the root cause and increased the quota as a short-term measure until we rollback the offending code.

timnew commented 3 years ago

@timsneath is there any status page for the pub.dev. like https://status.cloud.google.com for GCP? And will it be any public incident report for issue today?

timsneath commented 3 years ago

Can't speak to the status page yet; we'll figure out the right mitigations during the post mortem. I'm not sure it would have helped us much: the issue page seemed fairly effective to communicate status. But interested to hear from others.

Yes, we'll share the post-mortem summary. It should make for fun reading :) We operate a blameless post-mortem policy at Google; it's all about learning lessons rather than finding scapegoats. Any failure is a system failure, and we try and learn how we can address the system causes.

lookiestudio commented 3 years ago

pub.dev is now accessible from Vietnam.

vgsrivathsan commented 3 years ago

sock error pub get is showing socket error

BytesZero commented 3 years ago

袜子错误 pub get显示套接字错误

你被墙了

vgsrivathsan commented 3 years ago

袜子错误 pub get显示套接字错误

你被墙了

thanks!

themisir commented 3 years ago

Seriously, why didn't you cached YouTube calls for some period of time (eg: release cache a few times a day) in the beginning? Just wondering, seriously.

isoos commented 3 years ago

Seriously, why didn't you cached YouTube calls for some period of time (eg: release cache a few times a day) in the beginning? Just wondering, seriously.

We do cache them, here is the related code with history: https://github.com/dart-lang/pub-dev/blob/master/app/lib/service/youtube/backend.dart

However, once the fetched failed with the quota limit, the error propagated up in the chain - until the isolate was killed, restarted and with the restart we started to fetch it again. We will redesign/refactor this and similar background task so we can make sure such failures will not be propagated in the future.

themisir commented 3 years ago

Seriously, why didn't you cached YouTube calls for some period of time (eg: release cache a few times a day) in the beginning? Just wondering, seriously.

We do cache them, here is the related code with history: https://github.com/dart-lang/pub-dev/blob/master/app/lib/service/youtube/backend.dart

However, once the fetched failed with the quota limit, the error propagated up in the chain - until the isolate was killed, restarted and with the restart we started to fetch it again. We will redesign/refactor this and similar background task so we can make sure such failures will not be propagated in the future.

Oh I get it. :D Since the previous fetch is failed and the cache was empty the restarted isolate tried to fetch new data again, and it failed & crashed the isolate and gce restarted it then the loop continued... Interesting failure.

Thanks for letting us know!

Levi-Lesches commented 3 years ago

@timsneath + Google team Want to reiterate that -- thanks for the quick response and clear communication. Maintenance is one of those things that goes unnoticed until it's a bad thing, but it makes us appreciate that pub.dev is otherwise 100% reliable and easy-to-use. ❤️

jonasfj commented 3 years ago

Postmortem is referenced here: https://github.com/flutter/flutter/wiki/Postmortems

xi1570-krupeshanadkat commented 2 years ago

It seems it is down again (region South Asia - India)

Screenshot 2021-11-29 at 10 35 44 AM

Attached chrome devtools > Network tab screenshots for reference.

My network seems fine, rest of the stuff is opening correctly.

Screenshot 2021-11-29 at 10 37 50 AM
xi1570-krupeshanadkat commented 2 years ago

It seems it is down again (region South Asia - India)

Screenshot 2021-11-29 at 10 35 44 AM

Attached chrome devtools > Network tab screenshots for reference.

My network seems fine, rest of the stuff is opening correctly. Screenshot 2021-11-29 at 10 37 50 AM

Looks like it is working now!

Screenshot 2021-11-29 at 11 04 55 AM
manglide commented 2 years ago

It's still not working. Can't access pub.dev

isoos commented 2 years ago

@manglide: please open a new issue next time, we don't monitor closed issues.

As pub.dev is working for me, it is possible that the problem is at your ISP's side. Please run this script and return back its output: https://github.com/dart-lang/pub-dev/blob/master/app/bin/tools/check_domain_access.dart

manglide commented 2 years ago

Hi @isoos, thanks for your response. The issue is from my local dnsmasq configuration on mac. I have resolved it now and can access pub.dev.

Thanks once again.

luis901101 commented 1 year ago

Hi, is any problem with this again right now?

isoos commented 1 year ago

@luis901101 There is an outage right now, we are aware and trying to fix. Also: please don't comment on old threads.

iamchathu commented 8 months ago

Is there any public status page for pub.dev?

sigurdm commented 8 months ago

@iamchathu no we don't have such a page