Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.82k stars 2.93k forks source link

S3 UFS is getting much more read than needed to the UFS #14246

Open idanz opened 2 years ago

idanz commented 2 years ago

Alluxio Version: 2.6.2

Describe the bug When reading parquet files from cold cache with S3 UFS we see a huge bandwidth coming out of the store. I believe this is because the S3 request is built with a starting offset but not with an end range. so it reads as much as it can until the HTTP request is aborted. would be better to start the http request only after the read call happens and apply some reasonable "end range" based on the read request. or at least start with some configurable "readahead" like S3A does

To Reproduce

  1. Set up an S3 UFS with a large file
  2. Try to read via alluxio only 100 bytes from the file
  3. you should see the traffic from S3 is much greater than that

Expected behavior S3 reads should be as small as required

Urgency Blocker to use S3 UFS, but we can try other things

Additional context Add any other context about the problem here.

yuzhu commented 2 years ago

thanks @idanz

yuzhu commented 2 years ago

@ggezer PTAL

yuzhu commented 2 years ago

So some read amplification is expected, because Alluxio async caches on a block level, but it seems like we are also reading unnecessarily while just reading the original stream. @beinan could you take a look please?

beinan commented 2 years ago

Thank you @idanz ! I just want to better understand your use case, are you just trying to read the footer of each Parquet file? I'm thinking if we could cache parquet footer separately.

As @yuzhu said, alluxio is caching the data on a block level (64MB or 128MB), so reducing the block size might save you some bandwidth, you cloud try the config below.

alluxio.user.block.size.bytes.default=16MB

Note that small block size (<=8MB) might cause some issue on the compute engine (e.g. Presto) side, but I encourage you try 1MB block size to see what would happen.

Thank you again for reporting this issue to us!

idanz commented 2 years ago

I don't think that is the issue since we witnessed it while caching was completely turned off. We more understand that in certain cases impala will read all the "pages" in the parquet files which might be many and small in case the table has many columns (~500 for us). So in the case of alluxio over S3, this might mean thousands of unended S3 reads over the same file.

On October 19, 2021 04:54:31 Beinan @.***> wrote:

Thank you @idanz ! I just want to better understand your use case, are you just trying to read the footer of each Parquet file? I'm thinking if we could cache parquet footer separately. As @yuzhu said, alluxio is caching the data on a block level (64MB or 128MB), so reducing the block size might save you some bandwidth, you cloud try the config below. alluxio.user.block.size.bytes.default=16MB

Note that small block size (<=8MB) might cause some issue on the compute engine (e.g. Presto) side, but I encourage you try 1MB block size to see what would happen. Thank you again for reporting this issue to us! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

beinan commented 2 years ago

I don't think that is the issue since we witnessed it while caching was completely turned off. We more understand that in certain cases impala will read all the "pages" in the parquet files which might be many and small in case the table has many columns (~500 for us). So in the case of alluxio over S3, this might mean thousands of unended S3 reads over the same file. On October 19, 2021 04:54:31 Beinan @.***> wrote: Thank you @idanz ! I just want to better understand your use case, are you just trying to read the footer of each Parquet file? I'm thinking if we could cache parquet footer separately. As @yuzhu said, alluxio is caching the data on a block level (64MB or 128MB), so reducing the block size might save you some bandwidth, you cloud try the config below. alluxio.user.block.size.bytes.default=16MB Note that small block size (<=8MB) might cause some issue on the compute engine (e.g. Presto) side, but I encourage you try 1MB block size to see what would happen. Thank you again for reporting this issue to us! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

I see. looks like there would be multiple "positioned read" in parallel and they are somehow "endless". We should stop the reading as early as we can. Or we could read the content page by page(say 1MB per page) or row-group by row-group I'm on it now, will get back to you guys soon.

beinan commented 2 years ago

Hello @idanz & @yuzhu,

I found there is read amplification, but for each small read request, the data alluxio read from s3 won't beyond the chunk_size(1M).

I tried to read a small piece (100 bytes) repeatedly from s3 via alluxio, we can see the read method in S3AInputStream is trying to read 380,328 bytes from s3.

image

meanwhile, we can see the throughput of reading from remote

image

We can find the number of bytesToRead is actually from the UnderFileSystemBlockReader as below

int bytesToRead =
        (int) Math.min(buf.writableBytes(), mBlockMeta.getBlockSize() - mInStreamPos);
    int bytesRead = buf.writeBytes(mUnderFileSystemInputStream, bytesToRead);

we can see the bytesToRead is from the min value of buffer writable size and block size.

And the buffer size is defined in the BlockReaderHandler

chunkSize = (int) Math.min(mRequest.getEnd() - mContext.getPosToQueue(), mChunkSize);

The mChunkSize is 1M bytes, so as David and I guessed earlier, for small data pieces reads, alluxio won't read more than the chunk size (1M), @idanz, is this good for you? or you think we still need an even smaller number?

beinan commented 2 years ago

If we look at the grpc request of each read, looks like it always requests the entire file?

image

Ok, so I guess for each read it won't be beyond the chunk size, but it might keep looping to read the entire file. I will take a look at how we construct the request on client side

yuzhu commented 2 years ago

@beinan for your reference, https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/GetObjectRequest.html#setRange-long-long- , I think we are issuing request with setRange(start) only. see https://github.com/Alluxio/alluxio/blob/6a1876002e276c1a86d22b49f212e385f574f086/underfs/s3a/src/main/java/alluxio/underfs/s3a/S3AInputStream.java#L144

beinan commented 2 years ago

@beinan for your reference, https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/GetObjectRequest.html#setRange-long-long- , I think we are issuing request with setRange(start) only. see

https://github.com/Alluxio/alluxio/blob/6a1876002e276c1a86d22b49f212e385f574f086/underfs/s3a/src/main/java/alluxio/underfs/s3a/S3AInputStream.java#L144

@yuzhu thank you for pointing that out! Yes, as you said, we do not set an "end" in the range when opening the stream, but the stream looks like shared among all the reads on this file(object). Not sure what would happen if we set an "end" value here. (in the other hand, as I posted earlier, the length value passed to the read method in this class is too big, so if we use that value to set the "end", it won't help much I think)

https://github.com/Alluxio/alluxio/blob/6a1876002e276c1a86d22b49f212e385f574f086/underfs/s3a/src/main/java/alluxio/underfs/s3a/S3AInputStream.java#L113

Also as the doc said, "getObjectContent" https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/S3Object.html#getObjectContent--

The method is a simple getter and does not actually create a stream. If you retrieve an S3Object, you should close this input stream as soon as possible, because the object contents aren't buffered in memory and stream directly from Amazon S3. Further, failure to close this stream can cause the request pool to become blocked.

I'm reading their doc now to see if we could find a best practice.

yuzhu commented 2 years ago

i am wondering if https://github.com/Alluxio/alluxio/issues/14215 is related. There we somehow exhausted s3 client pool, and caused reads to eventually fail. Maybe we are not closing streams quickly enough? @LuQQiu @beinan

beinan commented 2 years ago

By reading through S3 the doc (also checked the code in prestodb), I think I got some basic ideas, we need set the range properly I think, I will post a PR soon.

beinan commented 2 years ago

Set the range in this PR https://github.com/Alluxio/alluxio/pull/14292 This pr speed up the small pieces read (100 bytes) from 1.4ops/sec to 6.5ops/sec. But we're still seeing read amplification (1MB vs 100bytes) image

The assumption here is we're always doing chuck read otherwise we might create the s3 http stream too frequently.

beinan commented 2 years ago

I think I fixed the read amplification.

image

# Warmup Iteration   2: 2.883 ops/s
# Warmup Iteration   3: 6.079 ops/s
# Warmup Iteration   4: 5.923 ops/s
# Warmup Iteration   5: 6.037 ops/s
Iteration   1: 5.989 ops/s
Iteration   2: 5.889 ops/s

It's reading 200 bytes per op, we can see 32,000 bytes/per min from UFS (200 byte/ops 6 ops/s 60s)

@idanz do you mind to cherry-pick https://github.com/Alluxio/alluxio/pull/14292 and give it a shot? thanks!

@yuzhu I also make the change to close the stream once the position reaches the end of the range, hope it would fix the thread pool leaking issue you mentioned here https://github.com/Alluxio/alluxio/issues/14215

Note: we need ReadPType.NO_CACHE to avoid reading the entire file, see the jmh code below:

    @Benchmark
    @Measurement(iterations = 5)
    @BenchmarkMode(Mode.Throughput)
    public void readBench(BenchState state) {
        FileInStream in = null;
        try {
            in = state.fs.openFile(state.path, OpenFilePOptions.newBuilder().setReadType(ReadPType.NO_CACHE).build());
            byte[] buf = new byte[200];
//            in.seek(state.rand.nextInt(15000));
//            in.read(buf, 0,  200);
            in.positionedRead(state.rand.nextInt(15000), buf, 0,  200);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                in.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
idanz commented 2 years ago

Thanks for the fix, we are currently exploring other options but anyway it sounds like you nailed it!

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.