Element84 / earth-search

Earth Search information and issue tracking
https://earth-search.aws.element84.com/v1
27 stars 2 forks source link

Ambiguous error message when the response is too large for a lambda #17

Closed idantene closed 8 months ago

idantene commented 12 months ago

Hey!

Following #583, I've migrated to using the EarthSearch v1. Shortly after migrating, I noticed our runs frequently fail with pystac_client.exceptions.APIError: {"message": "Internal server error"}. Following pystac-client's guide, I since added a Retry mechanism with a very comfortable backoff factor of 10, and I still observe many 502 responses.

I might've missed something in the migration, but the python stack currently consists of pystac-client, odc-stac, and dask. All are up-to-date. This hasn't happened in v0, so I'm wondering if there are additional configurations to add/use, or any rate limits one should observe?

gadomski commented 12 months ago

To help us investigate, can you provide more information about the errors and when they're happening? Including:

Thanks!

idantene commented 12 months ago

Sure, and thanks for the guidelines!

gadomski commented 12 months ago

It's a bit hard for me to pinpoint if it happens exactly on the same items, as we parallelize our main bounding box into smaller chunks for faster processing. So I can't just say at this point.

This could part of the issue. If possible, could you reduce/remove the parallelization and see if the problems go away? Feels like you're getting throttled but I haven't been able to confirm that from my side yet.

idantene commented 12 months ago

That was my intuition first too, which is why I massively reduced the parallelization. It still crashed, albeit slower.

I'd be surprised to find a throttling issue:

I'll try and pinpoint potential tiles. My other guess is that this could be a temporal issue? We're trying to access quite recent datasets (e.g. 27-30th of August) as our temporal cutoff date. Perhaps these are not ready yet and are causing some errors?

gadomski commented 12 months ago

Okay, that's possible, though a bad gateway would be surprising for missing data. Thanks for looking into this, I'll keep this open and keep checking things out on our end.

idantene commented 12 months ago

Thanks @gadomski!

One last thing - I'm now utilizing the same code for the Sentinel-2 L2A data with a different task, and it works fine with large parallelization (about 100 requests sent at the same time) - no crashes whatsoever.
The area partially overlap, but the temporal ranges are very different (this one only looks for a ~2.5 months interval, and only as recent as June, not August). It could be my own bias by now, but I have strong feelings that it's the cutoff date here.

I'll give it another shot with a bit less-recent one and report back.

idantene commented 12 months ago

Just reporting back that at least mid-August did not change the behaviour. I'm now trying to pinpoint if this is caused by a specific bounding box - that might take a while to find.

idantene commented 12 months ago

My apologies for the fast update @gadomski, and thank you for your fast feedback loop :)

I have identified a sample bounding box that reproduces this issue locally with the stac-client CLI:

stac-client search https://earth-search.aws.element84.com/v1 \
-c sentinel-2-l2a \
--bbox 23.993241082221626 59.88129918704023 24.10515372987819 59.93750368732248 \
--datetime 2018-01-01/2023-08-31

Yields:

{"message": "Internal server error"}
Traceback (most recent call last):
  File ".../pystac_client/cli.py", line 334, in cli
    return search(client, **args)
  File ".../pystac_client/cli.py", line 48, in search
    feature_collection = result.item_collection_as_dict()
  File ".../pystac_client/item_search.py", line 782, in item_collection_as_dict
    for page in self.pages_as_dicts():
  File ".../pystac_client/item_search.py", line 732, in pages_as_dicts
    for page in self._stac_io.get_pages(
  File ".../pystac_client/stac_api_io.py", line 304, in get_pages
    page = self.read_json(link, parameters=parameters)
  File ".../pystac/stac_io.py", line 205, in read_json
    txt = self.read_text(source, *args, **kwargs)
  File ".../pystac_client/stac_api_io.py", line 162, in read_text
    return self.request(
  File ".../pystac_client/stac_api_io.py", line 217, in request
    raise APIError.from_response(resp)
pystac_client.exceptions.APIError: {"message": "Internal server error"}

For reference, the v0 works fine (of course, this is limited to May 6th, as we recently observed):

stac-client search https://earth-search.aws.element84.com/v0 \
    -c sentinel-s2-l2a-cogs \
    --add-conforms-to ITEM_SEARCH \
    --bbox 23.993241082221626 59.88129918704023 24.10515372987819 59.93750368732248 \
    --datetime 2018-01-01/2023-08-31 | jq '.features | length'

> 2086

Finally, I believe it is must also be something about new updates to the catalog. The same query with end of July works fine:

stac-client search https://earth-search.aws.element84.com/v1 \
-c sentinel-2-l2a \
--bbox 23.993241082221626 59.88129918704023 24.10515372987819 59.93750368732248 \
--datetime 2018-01-01/2023-07-31 | jq '.features | length'

> 2448

In fact, I've tried it in 5 days increments and it works consistently (for this bounding box) up until the 27th of August, after which it produces the internal server error. That gives me a feeling as if the catalog is currently being updated for this specific location, perhaps? If that's the case, it would be great to learn e.g. how long it takes to catalog a new acquisition, and if the API/client could return a more verbose status other than internal server error.

gadomski commented 12 months ago

Thanks for the great report, and for digging in! While I can reproduce your error using the 2018-to-present search, I don't quite agree with your diagnosis. For example, the following request succeeds:

$ stac-client search https://earth-search.aws.element84.com/v1 \                                                                                                                                                                                                                 
    -c sentinel-2-l2a \
    --bbox 23.993241082221626 59.88129918704023 24.10515372987819 59.93750368732248 \
    --datetime 2019-01-01/2023-08-31 | jq '.features | length'
2178

This request also suceeds:

$ stac-client search https://earth-search.aws.element84.com/v1 \
    -c sentinel-2-l2a \
    --bbox 23.993241082221626 59.88129918704023 24.10515372987819 59.93750368732248 \
    --datetime 2018-01-01/2019-01-01 | jq '.features | length'
300

So, my guess is that pystac-client is getting throttled while paging through all of the items. Curiously, this succeeds (note the --limit):

$ stac-client search https://earth-search.aws.element84.com/v1 \
    -c sentinel-2-l2a \
    --bbox 23.993241082221626 59.88129918704023 24.10515372987819 59.93750368732248 \
    --datetime 2018-01-01/2023-08-31 \
    --limit 10 | jq '.features | length'
2478

So my guess right now is that things are being throttled at the backend (~elastic~opensearch) -- the --limit reduces the number of items fetched per request, slowing the load on the backend as more requests are made through the API. That's just a guess, though.

FWIW the default page size is 100:

$ python -c "import json; from pystac_client import Client; print(json.dumps(next(Client.open('https://earth-search.aws.element84.com/v1').search(collections=['sentinel-2-l2a']).pages_as_dicts())))" | jq ".features | length"
100

And setting --limit to 1000 causes a fast failure.

idantene commented 12 months ago

Ah! Great find! So you think the throttling is on a per-item basis (rather than per machine/IP)? That would make sense, though I must wonder why nearby bounding boxes (of the same size) pass without fail?

I'll give it a go on my end too.

gadomski commented 12 months ago

Yeah, I'm curious about that too -- maybe the failing bounding box intersects more tiles, so is grabbing more items from the backend than other bboxes 🤷🏼? I'm asking some more questions on my end as well to try to understand the throttling behavior better.

gadomski commented 12 months ago

Okay, at least part of the issue is a 6MB cap on the responses from a lambda, as described here: https://github.com/stac-utils/stac-server/issues/116. So that's why my --limit 1000 example fails. It still doesn't quite explain why the 2018-to-present query fails, but doesn't if you break it into chunks.

But, to be safe(er) you could reduce the limit from its default of 100 -- maybe 50?

idantene commented 12 months ago

I see, that makes sense then! I guess most responses are then somehow just on the upper limit anyway.

I will try to even reduce it down to 20 for future compatibility (I'd rather it takes a bit longer to iterate on the pages, since it's anyway dominated by other computation runtime).

I'll report back, as usual 😉

gadomski commented 12 months ago

I can confirm that it's a too-large response issue:

from requests import Session

with Session() as session:
    for limit in (99, 100):
        body = {
            "datetime": "2018-01-01T00:00:00Z/2023-08-31T23:59:59Z",
            "collections": ["sentinel-2-l2a"],
            "next": (
                "2019-04-03T10:06:01.465000Z," "S2B_34VFM_20190403_0_L2A,sentinel-2-l2a"
            ),
            "bbox": [
                23.993241082221626,
                59.88129918704023,
                24.10515372987819,
                59.93750368732248,
            ],
            "limit": limit,
        }
        response = session.post(
            "https://earth-search.aws.element84.com/v1/search", json=body
        )
        if response.ok:
            print(
                f"limit={limit}, status_code={response.status_code}, "
                f"response_size={len(response.content) / 1e6} MB"
            )
        else:
            print(
                f"limit={limit}, status_code={response.status_code}, "
                f"error={response.reason}"
            )

Output:

limit=99, status_code=200, response_size=5.963706 MB
limit=100, status_code=502, error=Bad Gateway

~I'll open an internal ticket to reduce the default page size so users don't accidentally hit this problem when doing simple requests.~ Thanks for the report and the debugging!

EDIT: Turns out it's not (completely) a server issue, its in pystac-client which is setting its own default limit. I'm going to remove that behavior over there, and we'll still look into providing a better error message when the response is too big for the lambda.

idantene commented 12 months ago

That makes sense, and I can also confirm that these errors no longer show when I set the limit to 20.

Great stuff, thanks for debugging on your end!

idantene commented 12 months ago

@gadomski I see in your PR that the suggestion is to remove the default limit altogether. Does the backend actually determine the ideal page size on its own, depending on the incoming request parameters? Is there a way to verify this works (and the returned page size) before the PR is merged?

I'm trying to think ahead and decide what's safer on our end, setting a low page size (e.g. 20 as above) or already set it to None so that the server decides.

gadomski commented 12 months ago

Does the backend actually determine the ideal page size on its own, depending on the incoming request parameters?

Not in my experience, but presumably the backend is configured with a "reasonable" page size default for the datasets its hosting. The client has no knowledge about the backend architecture, so (in my opinion) shouldn't try to guess what the correct setting is.

Is there a way to verify this works (and the returned page size) before the PR is merged?

We have unit tests in pystac-client that hit both Earth Search and Planetary Computer (two of the largest STAC servers) and those seemed happy enough. I did ask for a review from someone on Planetary Computer, as it is a pretty opinionated change that I want to be careful with.

I'm trying to think ahead and decide what's safer on our end, setting a low page size (e.g. 20 as above) or already set it to None so that the server decides.

🤷🏼 I don't know if I have a strong opinion on this one -- there's a balance between "we're making too many requests" and "the responses are too big." Our exercise in this ticket seems to indicate that, at least for sentinel-2 on Earth Search, a 20-80 item page should be fine.

idantene commented 12 months ago

Fair enough - thanks for the detailed response. I'll leave it at 20 for the time being then.

Big thanks @gadomski for your help!

gadomski commented 11 months ago

FYI the new release of pystac-client removes the default limit: https://github.com/stac-utils/pystac-client/releases/tag/v0.7.4.