datamade / scrapers-us-municipal

Scrapers for US municipal governments.
MIT License
10 stars 8 forks source link

Procedure for handling "cannot resolve" Sentry errors #24

Open reginafcompton opened 6 years ago

reginafcompton commented 6 years ago

Recently, we increased the level of logging to Sentry to help DataMade quickly identify data problems, before the client does.

What should we do with cannot resolve pseudo id to Bill warnings?

Potential step-by-step: (1) check if bills are in Legistar; (2a) if they are, then add them to this issue: https://github.com/opencivicdata/scrapers-us-municipal/issues/241 Ignore the Sentry error (since the error has been recorded in a Github issue). (2b) if they are not, then they might be private, but will become public. So, keep an eye on it? contact Metro? ignore the Sentry error? I am not sure.....

I am also not sure if we still need this level of logging, given that we’ll be aggressively scraping all bills every Friday.

hancush commented 6 years ago

+1 on reducing errors, it makes it very hard to use both sentry and the semaphor channel

hancush commented 6 years ago

i would like to propose that we devise a way of creating a digest of things we cannot resolve, and logging it in one place, i.e., https://github.com/opencivicdata/scrapers-us-municipal/issues/241, rather than logging each and every one of these instances as a sentry error.

reginafcompton commented 6 years ago

I like the idea of a digest. It correlates with what @fgregg proposes in point 2 here. That is, we'd capture these unresolved bill errors and then scrape Metro for just those bills. We'd have a log of what that special scrape does, as opposed to Sentry errors.

fgregg commented 6 years ago

I think that we can move to a digest, or even reducing the level of logging once we have an understanding of all the reasons why unresolved bills (and other thigns) appear. I don't think we are there yet.

hancush commented 6 years ago

@fgregg I definitely agree that we need to get to the bottom of this problem, but I'm not sure that needs to happen at the expense of one of our primary channels of communication. 10+ often redundant error notifications on every scrape is a lot, especially when the scrapes are happening at an increased frequency on Fridays. Coupled with pretty crappy search functionality in Semaphor, it becomes way too easy to lose track of conversations. Is there a way we can reconcile the log level with our communication needs? What about a separate channel for pupa errors?

fgregg commented 6 years ago

@hancush

  1. I think its's a great idea to split channels between conversation and logging
  2. I'm not very concerned with redundancy of alerts if we are being alerted by things that are problems.

If we know that something is not a problem, is the ignoring of those events in sentry sufficient. If not, why not?

reginafcompton commented 6 years ago

I too like the idea of separate channels, but I don't think it's just a matter of distinguishing between conversation and logging, since the pupa-cannot-resolve-errors stand to obscure other meaningful Councilmatic errors (e.g., from Miami, or import_data or Solr...etc.)

I'd rather see a separate channel for the Pupa errors entirely, and then preserve the Councilmatic channel as it has been in the past.

I also think we can ignore Pupa errors once (1) we made a note of the error in a relevant Github issue (see above), or (2) we can absolutely identify the error as not a problem.

hancush commented 6 years ago

It may be that we have just not stemmed the tide of this class of error just yet, but I muted at least 15 cannot resolve errors Friday and it felt like at least that many more came in the next scrape to take their place. These felt urgent to resolve, because I knew the errors would just recur 20 minutes later and further clog the channel. I would estimate I spent about an hour on this quasi-urgent task and related context switching. I'm sure @reginafcompton lost some time on it, as well.

In summary, I do not feel that muting alone addresses the problem, because it is time consuming and – so far – less effective than I would like at keeping the notifications at bay. Perhaps the number of errors will be reduced when we've spent the time to mute them all; but it seems like by that point, not being notified at all would be the same solution, except it wouldn't cost us the hours.

To your point about redundancy, I would strongly prefer that alerts not be redundant. It becomes too easy to ignore them, and potentially miss a meaningful one. Moreover, we don't learn anything from redundant alerts, apart from that the error is still happening, which we can already assume, because we know it's often not self-resolving, and we haven't made a change to fix it.

fgregg commented 6 years ago

For the flooding issue, it seems like we can address that by changing the frequency of reporting to semaphor

screenshot_2018-07-23 sentry

In my opinion @evz should not move the civicpro scrapers to a separate repo, since different people have responsibility for addressing those.

We already have councilmatic channel, where councilmatic errors should be located.

fgregg commented 6 years ago

I updated the semaphor rule so that a "warning or error" level issue will only be reported once per 24 hours. critical errors will still be reported up to every 5 minutes.

reginafcompton commented 6 years ago

Right @fgregg - I meant "obscure other meaningful SCRAPER errors", not Councilmatic errors.

I think that Semaphor update will make a difference.

We also need to undo the change to LOGGING from Friday. https://github.com/datamade/scrapers-us-municipal/pull/25 I can do that this morning.

I am not sure, however, if we have an agreed upon step-by-step for dealing with these Pupa warnings. Does what I summarized above make sense? I think if we really want to understand the nature of these errors, then we'll need to think more about my suggested (2b).

reginafcompton commented 6 years ago

I checked today's batch of "cannot resolve" errors against Legistar: none of them were present in the API.

I propose that we make a consolidated list of these bills (we can take a look at the scraper logs to get past errors) and send it to Metro. We need their help to determine if these bills: (1) are private and will remain private (in which case no action from us is needed); (2) are private and will be come public; (3) are something else....

Then, we can make a plan for resolution.

I can pull together a list today and send it to Metro.

fgregg commented 6 years ago

Do we understand why we only saw them alerted today?

On Wed, Jul 25, 2018 at 9:24 AM Regina Compton notifications@github.com wrote:

I checked today's batch of "cannot resolve" errors against Legistar: none of them were present in the API.

I propose that we make a consolidated list of these bills (we can take a look at the scraper logs to get past errors) and send it to Metro. We need their help to determine if these bills: (1) are private and will remain private (in which case no action from us is needed); (2) are private and will be come public; (3) are something else....

Then, we can make a plan for resolution.

I can pull together a list today and send it to Metro.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/datamade/scrapers-us-municipal/issues/24#issuecomment-407772615, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgxbfXVs6bgHRefywU5urU3DMTH7r50ks5uKH-IgaJpZM4VYHqK .

reginafcompton commented 6 years ago

I am not sure I understand your question @fgregg - can you say more?

fgregg commented 6 years ago

Is this the first time we got this alert from sentry? If so, why? We scrape the events every night, so shouldn't we have seen these before.

hancush commented 6 years ago

that's actually an interesting question, @fgregg – looking at the frequency of occurrence charts in sentry (check em out!), it looks like these recur, but not every night. (it's possible the reason for this is totally obvious and i'm just not in the scraper headspace.) in any case, the ones from today aren't new.

fgregg commented 6 years ago

i have suspicion that this is worth figuring out.

reginafcompton commented 6 years ago

"Why did we not see these alerts more often?"

Forest turned up the volume on Pupa logging on June 22; I turned down the scraper volume from July 20-24. Sentry thus had 29 days to alert us about unresolved bills. However, according to our Semaphor chat, we periodically and a little haphazardly ignored (several, but not all) alerts for a period of time (e.g., for a week, until Monday, etc.) on July 5, 6, 13, 19, 20. This would explain why these bills do not have consistent daily alerts, for example: https://sentry.io/datamade/scrapers-us-municipal/issues/587748286/events/

screen shot 2018-07-26 at 10 33 15 am

A couple inconsistencies – I see that some bills do not have alerts until later in June...why is that? https://sentry.io/datamade/scrapers-us-municipal/issues/591071227/events/ https://sentry.io/datamade/scrapers-us-municipal/issues/591071135/events/

For many bills, we did not get alerts on July 3 or 4 - were they ignored? (@hancush do you recall?)

reginafcompton commented 6 years ago

Coming to terms with the Pupa errors

Shelly gave us terrific information about some of these unresolved bills. (I gave her a large sample to look into.) Given this information and what we learned in this issue, we can distinguish four types of bills that raise the "Cannot resolve error":

  1. 2015-**** bills referenced in agendas that Metro created April and May 2015 - these are "practice" entries and do not have finalized agendas in Legistar.
  2. General Public Comments reports from May 2018 (2018-0316, 2018-0315, 2018-0312). Agendas reference these bills when Metro was using the commenting system in beta.
  3. Newly created Bills that remain private until the agenda is ready.
  4. Other bills that the scraper misses for sundry reasons to be determined. We know about two of these.

Actionable steps

I am most concerned about classes (3) and (4), since these have caused issues in the past. On one hand, we've confronted this problem by aggressively scraping all bills on Fridays. However, this strategy slows the bill import time (from a maximum of 30 minutes to 45 minutes), since it takes about 22 minutes for the scraper to grab all bills. Alternative, more efficient strategies include:

In the short term, I prefer the first option (a windowed scrape of bills from the last year), since it's an easy adjustment.

Ideally, I would like our scrapers to have access to private bills. Why? Then Pupa errors will carry greater meaning, whereas now, we just get a flood of errors on certain Fridays and think, "oh well, these must be private bills that will soon become public....la-te-dah."

reginafcompton commented 6 years ago

Metro tested switching bills from private to public using a few techniques. I outlined the results of those tests here.

Specifically, I learned two meaningful pieces of information:

(1) Publishing an agenda does not change the timestamp of the "Not viewable" bills, to which the agenda refers. The bills become public, but their MatterLastModifiedUtc remains unchanged. This confirms what we already suspected.

(2) Manually unchecking the "Not viewable" box for a bill does change the MatterLastModifiedUtc timestamp.


Next steps

With this knowledge, we have a few options, though one seems better than the others.

I think our best option is to write some code that scrapes bills related to newly published agendas, something like:

  1. find all the events with a newly published agenda (i.e., using the EventAgendaLastPublishedUTC).
  2. iterate over each event's eventitems
  3. scrape the bills referenced in the items

This logic could reside in the LAMetro bills scraper, though we could make some changes further upstream (assuming that this problem affects NYC and Chicago?).

fgregg commented 6 years ago

So getting access to private bills is off the table?

hancush commented 6 years ago

could we check the agendalastmodified date for bills, like we do for events, in the python-legistar scraper?

edit: oh, haha, bills don't have agendas..... NEVER MIND ME.

reginafcompton commented 6 years ago

@fgregg - Omar is looking into it. Let's wait for his reply before acting on anything.

reginafcompton commented 6 years ago

From Metro:

'Unfortunately, we don’t know of a way to give the scraper access to the “Not Viewable on Insite" reports. Omar has asked Granicus about this in the past, and received back either “we’ll look into it” or no response at all.'