403 when locally hosted cc-index-server tries to connect to s3://commoncrawl/

commoncrawl / cc-index-server

Common Crawl Index Server

http://index.commoncrawl.org/

65 stars 18 forks source link

403 when locally hosted cc-index-server tries to connect to s3://commoncrawl/ #11

Open davetbo-amzn opened 1 year ago

davetbo-amzn commented 1 year ago

Whenever I do a search on the local cc-index-server I get errors. When I look at the debug logs, it looks like the final authorization is only using the access key ID and the secret, but not the session token.

Is this only designed to work with long-term IAM user creds, or does it support short term creds? If I were to go edit the file building that Authorization, where would I find it? I searched the code globally for Authorization, access_key, and access, excluding the cluster.idx files, and found nothing that matched.

I'd be happy to contribute the fix for supporting short-term creds if you help me find where the fix goes in your code.

sebastian-nagel commented 1 year ago

Could you try the branch pywb2?

it's based on PyWB2, resp. a modified version / fork (commoncrawl/pywb, branch common-crawl-cdx-index) which tries to stay 100% API compatible with older versions of CC's index server based on PyWB 1.x.
this version includes webrecorder/pywb#723 - I hope it will also cover short-term creds

Apologies, I hoped to finalize the version based on PyWB2 but hadn't the time yet. There's also some work to do:

need to rebase on a more recent PyWB2 version
the lazy instantiation of the S3 client in the S3 loader might need some more improvements: if the creation of the S3 client or the get_object fails, it is created again and again which is not nice and may cause troubles because any logic implemented in the client to handle the errors (e.g., an exponential back-off on "503 Slow Down" responses) is impossible because the state hold in the client is lost. In addition, in the specific case of s3://commoncrawl/ the fall-back instantiating a client with unauthenticated access is useless anyway.

davidtbo commented 1 year ago

I discovered that it works with long term user creds but not short term. When I got it working, I realized it still didn't pull back the actual content from the warc files for me, it just gave me the same index info I already had working through Athena.

So then I had to go back to figure out extracting the gzip data from the S3 warc files myself. I discovered that my blocker there was that I was doing:

start_byte = int(row[22]) end_byte = start_byte + int(row[23]) # either this or the previous line should have had a -1 in it to shift to zero as the first byte.

And then I finally found an example somewhere where someone added the - 1 to that equation. After that, i could successfully extract and decompress with gzip.

Since I've solved my issue and I don't currently have time to stop and troubleshoot further, I'll have to stop with the feedback that the root cause is that the current code doesn't support short-term creds.

sebastian-nagel commented 1 year ago

with the feedback that the root cause is that the current code doesn't support short-term creds.

Thanks for the feedback. I'll have a look, but it may take some time.

should have had a -1

The byte range is inclusive, so it is offset -- (offset + length - 1).

For bulk look-ups the columnar index is more efficient, see here. A user even wrote a tutorial how to automatize the fetching the WARC records using AWS Lambda.

davidtbo commented 1 year ago

In that second link, it looks like we can get event-based triggers from the commoncrawl bucket. Is that the case? If so, that's super helpful. I was assuming we couldn't because it was in their account.

However, reading it again, it might just be subscribing to the Ath ena results landing in my bucket when I do the query for the web pages I want from the index.

sebastian-nagel commented 1 year ago

Yes, that's also my understanding: the appearance of a query result file (not on s3://commoncrawl/) triggers the download of all referenced WARC records stored on s3://commoncrawl/.