Return structured data for JS source file and source map Scraping attempts

lforst commented 9 months ago

For our Source Map Debugger (Blue Thunder Edition™) initiative we require information from Symbolicator about the JS source scraping attempts that have been conducted.

Solution brainstorm

Returned payload (could be nested somewhere):

interface ReturnedPayload {
  symbolication_scraping_attempts: {
    source_file_url: string;
    source_file_scraping:
      | { result: 'not-attempted'; reason_code: number }
      | { result: 'error'; error_code: number; scraping_http_status: number }
      | { result: 'success'};
    source_map_url: string;
    source_map_scraping:
      | { result: 'not-attempted'; reason_code: number}
      | { result: 'error'; error_code: number; scraping_http_status: number }
      | { result: 'success' };
  }[];
}

// Possible error_code Mappings:
// 1: "Unknown"
// 2: "Not found"
// 3: "File too big"
// ... TBD

// Possible reason_code Mappings:
// 1: "Unknown"
// 2: "Already have source through other process"
// ... TBD

(TypeScript notation because it's the only language I know 😶‍🌫️)

loewenheim commented 9 months ago

I would guess that we also want a "forbidden" or "disabled" result variant for when scraping is disabled.

ETA: Actually I believe there are really two reasons for "not attempted": either we already have it or scraping is disabled. Consequently we should split up "not attempted" into two results and remove the reason.

loewenheim commented 9 months ago

Also: does your data intentionally only include the URL of the source file, but not that of the sourcemap?

lforst commented 9 months ago

I would guess that we also want a "forbidden" or "disabled" result variant for when scraping is disabled.

I think "forbidden" can fall under error and "disabled" under not-attempted. (With corresponding codes of course.)

Also: does your data intentionally only include the URL of the source file, but not that of the sourcemap?

Ah, thanks. Good catch! Adjusted.

loewenheim commented 9 months ago

I think "forbidden" can fall under error and "disabled" under not-attempted. (With corresponding codes of course.)

Sorry, I intended "forbidden" and "disabled" to be the same thing, I just hadn't decided what the better name was yet (it's probably "disabled").

loewenheim commented 9 months ago

I opened https://github.com/getsentry/symbolicator/pull/1311 as a strawman implementation so that we have something to discuss.

The CompletedJsSymbolicationResponse now contains a list of "scraping attempts". The definition of these scraping attempts is

pub struct JsScrapingAttempt {
    pub url: String,
    pub result: JsScrapingResult,
}

pub enum JsScrapingResult {
    NotAttempted,
    Success,
    Failure {
        reason: JsScrapingFailureReason,
        details: String,
    },
}

pub enum JsScrapingFailureReason {
    NotFound,
    Disabled,
    InvalidHost,
    PermissionDenied,
    Timeout,
    DownloadError,
    Other,
}

So each scraping attempt has a url and a result. The result can be NotAttempted, Success, or Failure. Success and NotAttempted contain no further information (NotAttempted just means that we got the file another way), but Failure has an additional Reason and possibly details.

Here is an example from out test fixtures (written in YAML, but you'd be getting it as JSON):

scraping_attempts:
  - url: "http://example.com/index.html"
    result:
      Failure:
        reason: Disabled
  - url: "http://example.com/test.min.js"
    result: NotAttempted
  - url: "http://example.com/test.min.js.map"
    result: NotAttempted

This differs from your design above in a few respects.

We don't bundle the scraping attempts for a source file and its sourcemap together. This could be done, but it would complicate the code a fair bit. You can still associate them because every frame in the response contains the sourcemap URL.
The details of the failure cases are different. We return a simple "reason" and possibly "details" instead of an error code and the http status. The error codes could certainly still be done, but getting the http status would need significant changes to our downloader.

Are these differences acceptable to you? Is there anything you would like to see changed?

lforst commented 9 months ago

@loewenheim Thank you! This is great <3

I have a few thoughts.

You can still associate them because every frame in the response contains the sourcemap URL.

Is that the case in the event payload we store in eventstore? Would we have to also adjust Sentry to put the sourcemap URL into the stack frame object?

We return a simple "reason" and possibly "details" instead of an error code and the http status.

HTTP status isn't too important I would say. I have reservations about the returned payload schema though. IMO it would be cleaner to always return an object in the result field. To illustrate, here is what I suggest we change the example you gave to:

scraping_attempts:
  - url: "http://example.com/index.html"
    result:
        status: Failure
        reason: Disabled
  - url: "http://example.com/test.min.js"
    result:
        status: NotAttempted
  - url: "http://example.com/test.min.js.map"
    result:
        status: NotAttempted

That way we don't have to check for the type of result and can basically assume result.status is a string and result.reason is an optional string. This is obviously up for debate.

Additionally, I would probably return the strings in a bit more pythonian manner, e.g. not_attempted instead of NotAttempted and so on. This is a bit whatever.

In general I think we are on the right track! My main concern lies in whether we can actually properly associate source map scraping attempts to the correct stack frame.

loewenheim commented 9 months ago

@loewenheim Thank you! This is great <3

I have a few thoughts.

You can still associate them because every frame in the response contains the sourcemap URL.

Is that the case in the event payload we store in eventstore? Would we have to also adjust Sentry to put the sourcemap URL into the stack frame object?

The information already exists in the event JSON you can view on Sentry, if that's what you mean.

We return a simple "reason" and possibly "details" instead of an error code and the http status.

HTTP status isn't too important I would say. I have reservations about the returned payload schema though. IMO it would be cleaner to always return an object in the result field. To illustrate, here is what I suggest we change the example you gave to:
scraping_attempts:
  - url: "http://example.com/index.html"
    result:
        status: Failure
        reason: Disabled
  - url: "http://example.com/test.min.js"
    result:
        status: NotAttempted
  - url: "http://example.com/test.min.js.map"
    result:
        status: NotAttempted
That way we don't have to check for the type of result and can basically assume result.status is a string and result.reason is an optional string. This is obviously up for debate.

That sounds like a good idea and would be easy to change.

Additionally, I would probably return the strings in a bit more pythonian manner, e.g. not_attempted instead of NotAttempted and so on. This is a bit whatever.

You're right, I'll change it to snake_case.

In general I think we are on the right track! My main concern lies in whether we can actually properly associate source map scraping attempts to the correct stack frame.

I'll send you an example event that I believe demonstrates that we can.

loewenheim commented 9 months ago

What do you think of this?

scraping_attempts:
  - url: "http://example.com/index.html"
    status: failure
    reason: disabled
  - url: "http://example.com/embedded.js"
    status: not_attempted
  - url: "http://example.com/embedded.js.map"
    status: not_attempted

status is always present, reason and details may be present for failures. We can also nest it inside a result the way you suggested, if you'd prefer that. Fortunately this serialization logic is very flexible.

We could also do the scraping_attempts as a map from URLs to results instead of a list. What do you think?

lforst commented 9 months ago

I'll send you an example event that I believe demonstrates that we can.

Based on our discussion that data.sourcemap is available inside stackframe objects, having source files and sourcemaps separate sounds good to me!

status is always present, reason and details may be present for failures. We can also nest it inside a result the way you suggested, if you'd prefer that. Fortunately this serialization logic is very flexible.

No this seems great! Let's do it exactly like that :)

We could also do the scraping_attempts as a map from URLs to results instead of a list. What do you think?

I think a list is fine. It keeps us more flexible down the line. Of course, there is now the cost of having to iterate but the size is bounded anyhow so 🤷 No strong opinions. If we don't have a clear favorite I would opt for a list to be consistent with other mechanisms we have (debug_meta).

loewenheim commented 9 months ago

I think a list is fine. It keeps us more flexible down the line. Of course, there is now the cost of having to iterate but the size is bounded anyhow so 🤷 No strong opinions. If we don't have a clear favorite I would opt for a list to be consistent with other mechanisms we have (debug_meta).

Makes sense, I'll leave it as a list then.

getsentry / symbolicator

Return structured data for JS source file and source map Scraping attempts #1300

Solution brainstorm