Open idanz opened 2 years ago
thanks @idanz
@ggezer PTAL
So some read amplification is expected, because Alluxio async caches on a block level, but it seems like we are also reading unnecessarily while just reading the original stream. @beinan could you take a look please?
Thank you @idanz ! I just want to better understand your use case, are you just trying to read the footer of each Parquet file? I'm thinking if we could cache parquet footer separately.
As @yuzhu said, alluxio is caching the data on a block level (64MB or 128MB), so reducing the block size might save you some bandwidth, you cloud try the config below.
alluxio.user.block.size.bytes.default=16MB
Note that small block size (<=8MB) might cause some issue on the compute engine (e.g. Presto) side, but I encourage you try 1MB block size to see what would happen.
Thank you again for reporting this issue to us!
I don't think that is the issue since we witnessed it while caching was completely turned off. We more understand that in certain cases impala will read all the "pages" in the parquet files which might be many and small in case the table has many columns (~500 for us). So in the case of alluxio over S3, this might mean thousands of unended S3 reads over the same file.
On October 19, 2021 04:54:31 Beinan @.***> wrote:
Thank you @idanz ! I just want to better understand your use case, are you just trying to read the footer of each Parquet file? I'm thinking if we could cache parquet footer separately. As @yuzhu said, alluxio is caching the data on a block level (64MB or 128MB), so reducing the block size might save you some bandwidth, you cloud try the config below. alluxio.user.block.size.bytes.default=16MB
Note that small block size (<=8MB) might cause some issue on the compute engine (e.g. Presto) side, but I encourage you try 1MB block size to see what would happen. Thank you again for reporting this issue to us! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
I don't think that is the issue since we witnessed it while caching was completely turned off. We more understand that in certain cases impala will read all the "pages" in the parquet files which might be many and small in case the table has many columns (~500 for us). So in the case of alluxio over S3, this might mean thousands of unended S3 reads over the same file. … On October 19, 2021 04:54:31 Beinan @.***> wrote: Thank you @idanz ! I just want to better understand your use case, are you just trying to read the footer of each Parquet file? I'm thinking if we could cache parquet footer separately. As @yuzhu said, alluxio is caching the data on a block level (64MB or 128MB), so reducing the block size might save you some bandwidth, you cloud try the config below. alluxio.user.block.size.bytes.default=16MB Note that small block size (<=8MB) might cause some issue on the compute engine (e.g. Presto) side, but I encourage you try 1MB block size to see what would happen. Thank you again for reporting this issue to us! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
I see. looks like there would be multiple "positioned read" in parallel and they are somehow "endless". We should stop the reading as early as we can. Or we could read the content page by page(say 1MB per page) or row-group by row-group I'm on it now, will get back to you guys soon.
Hello @idanz & @yuzhu,
I found there is read amplification, but for each small read request, the data alluxio read from s3 won't beyond the chunk_size(1M).
I tried to read a small piece (100 bytes) repeatedly from s3 via alluxio, we can see the read method in S3AInputStream is trying to read 380,328 bytes from s3.
meanwhile, we can see the throughput of reading from remote
We can find the number of bytesToRead is actually from the UnderFileSystemBlockReader as below
int bytesToRead =
(int) Math.min(buf.writableBytes(), mBlockMeta.getBlockSize() - mInStreamPos);
int bytesRead = buf.writeBytes(mUnderFileSystemInputStream, bytesToRead);
we can see the bytesToRead is from the min value of buffer writable size and block size.
And the buffer size is defined in the BlockReaderHandler
chunkSize = (int) Math.min(mRequest.getEnd() - mContext.getPosToQueue(), mChunkSize);
The mChunkSize is 1M bytes, so as David and I guessed earlier, for small data pieces reads, alluxio won't read more than the chunk size (1M), @idanz, is this good for you? or you think we still need an even smaller number?
If we look at the grpc request of each read, looks like it always requests the entire file?
Ok, so I guess for each read it won't be beyond the chunk size, but it might keep looping to read the entire file. I will take a look at how we construct the request on client side
@beinan for your reference, https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/GetObjectRequest.html#setRange-long-long- , I think we are issuing request with setRange(start)
only. see https://github.com/Alluxio/alluxio/blob/6a1876002e276c1a86d22b49f212e385f574f086/underfs/s3a/src/main/java/alluxio/underfs/s3a/S3AInputStream.java#L144
@beinan for your reference, https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/GetObjectRequest.html#setRange-long-long- , I think we are issuing request with
setRange(start)
only. see
@yuzhu thank you for pointing that out! Yes, as you said, we do not set an "end" in the range when opening the stream, but the stream looks like shared among all the reads on this file(object). Not sure what would happen if we set an "end" value here. (in the other hand, as I posted earlier, the length value passed to the read method in this class is too big, so if we use that value to set the "end", it won't help much I think)
Also as the doc said, "getObjectContent" https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/S3Object.html#getObjectContent--
The method is a simple getter and does not actually create a stream. If you retrieve an S3Object, you should close this input stream as soon as possible, because the object contents aren't buffered in memory and stream directly from Amazon S3. Further, failure to close this stream can cause the request pool to become blocked.
I'm reading their doc now to see if we could find a best practice.
i am wondering if https://github.com/Alluxio/alluxio/issues/14215 is related. There we somehow exhausted s3 client pool, and caused reads to eventually fail. Maybe we are not closing streams quickly enough? @LuQQiu @beinan
By reading through S3 the doc (also checked the code in prestodb), I think I got some basic ideas, we need set the range properly I think, I will post a PR soon.
Set the range in this PR https://github.com/Alluxio/alluxio/pull/14292 This pr speed up the small pieces read (100 bytes) from 1.4ops/sec to 6.5ops/sec. But we're still seeing read amplification (1MB vs 100bytes)
The assumption here is we're always doing chuck read otherwise we might create the s3 http stream too frequently.
I think I fixed the read amplification.
# Warmup Iteration 2: 2.883 ops/s
# Warmup Iteration 3: 6.079 ops/s
# Warmup Iteration 4: 5.923 ops/s
# Warmup Iteration 5: 6.037 ops/s
Iteration 1: 5.989 ops/s
Iteration 2: 5.889 ops/s
It's reading 200 bytes per op, we can see 32,000 bytes/per min from UFS (200 byte/ops 6 ops/s 60s)
@idanz do you mind to cherry-pick https://github.com/Alluxio/alluxio/pull/14292 and give it a shot? thanks!
@yuzhu I also make the change to close the stream once the position reaches the end of the range, hope it would fix the thread pool leaking issue you mentioned here https://github.com/Alluxio/alluxio/issues/14215
Note: we need ReadPType.NO_CACHE
to avoid reading the entire file, see the jmh code below:
@Benchmark
@Measurement(iterations = 5)
@BenchmarkMode(Mode.Throughput)
public void readBench(BenchState state) {
FileInStream in = null;
try {
in = state.fs.openFile(state.path, OpenFilePOptions.newBuilder().setReadType(ReadPType.NO_CACHE).build());
byte[] buf = new byte[200];
// in.seek(state.rand.nextInt(15000));
// in.read(buf, 0, 200);
in.positionedRead(state.rand.nextInt(15000), buf, 0, 200);
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Thanks for the fix, we are currently exploring other options but anyway it sounds like you nailed it!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.
Alluxio Version: 2.6.2
Describe the bug When reading parquet files from cold cache with S3 UFS we see a huge bandwidth coming out of the store. I believe this is because the S3 request is built with a starting offset but not with an end range. so it reads as much as it can until the HTTP request is aborted. would be better to start the http request only after the read call happens and apply some reasonable "end range" based on the read request. or at least start with some configurable "readahead" like S3A does
To Reproduce
Expected behavior S3 reads should be as small as required
Urgency Blocker to use S3 UFS, but we can try other things
Additional context Add any other context about the problem here.