Open loicalleyne opened 2 months ago
The delete is supposed to happen here. Can you share the steps you used to reproduce this with Trino?
@eric-maynard I ran the queries listed in the To Reproduce of the OP in Trino. Do you need Trino configuration information in addition to this?
Sure, or if you can reproduce this any other way (via calls to the CLI, or another engine) that works too! Just hoping to create a small minimally reproducing example to debug. A unit test would be ideal.
Polaris does not delete any historical metadata.json files, only the current one will be deleted.
This Iceberg PR you posted fixed it in Iceberg class CatalogUtil
, but Polaris doesn't use it due to some reasons, like async deletion. To fix it, we can follow the approach in the PR to get a full set of file to delete in Polaris.
specifically, we need the following two methods to get all metadata.json files and stats files
private static Set<String> metadataLocations(TableMetadata tableMetadata) {
Set<String> metadataLocations =
tableMetadata.previousFiles().stream()
.map(TableMetadata.MetadataLogEntry::file)
.collect(Collectors.toSet());
metadataLocations.add(tableMetadata.metadataFileLocation());
return metadataLocations;
}
private static Set<String> statsLocations(TableMetadata tableMetadata) {
return tableMetadata.statisticsFiles().stream()
.map(StatisticsFile::path)
.collect(Collectors.toSet());
}
I've been building an Iceberg playground using my fork of insta-infra to bring up containerized Polaris, postgres, minio and more. I'll push a repro setup to a repo tomorrow and post the link here.
I've just remembered that Polaris doesn't work with Minio due to lack of STS support. Do you know of any other service that can provide local S3 with STS? Or are you ok to use your own S3/GCS credentials for testing?
Do you know of any other service that can provide local S3 with STS?
Not as I know. May @snazy and @dimas-b know more options. BTW, can you open another issue for this question? It's a bit off-topic here.
I've just remembered that Polaris doesn't work with Minio due to lack of STS support. Do you know of any other service that can provide local S3 with STS? Or are you ok to use your own S3/GCS credentials for testing?
It appears Localstack has STS support. https://docs.localstack.cloud/user-guide/aws/sts/
Hi @flyrain, I am new to Polaris community and this task seems good to start with, may I work on this issue?
Feel free to take it, @danielhumanmod.
Hi @flyrain (@sfc-gh-ygu ) @eric-maynard, thank you for all the valuable context in the discussion. I have created a draft PR for this issue. Before it's ready for review, I list some points that need to be discussed in the "To be discussed" section in the PR, greatly appreciate it if you could provide some insight about these questions!
Thanks @danielhumanmod for working on it. We will only need to add support for the historical metadata.json files and stats files. Others are token care already.
Is this a possible security vulnerability?
Describe the bug
When dropping a table, the data folder is deleted but the metadata folder remains with the metadata.json files it contains.
To Reproduce
Using postgres as the metadata store, and GCS for storage.
The queries below (executed in Trino) create a schema/namespace in the catalog, creates a table from some sample data in BigQuery(the data can be from anywhere really), copies the data to a second table, then drops the second table.
Actual Behavior
files in
{table name}/metadata
are not deletedExpected Behavior
All files in the dropped table's path are deleted.
Additional context
postgres as backing metadata store GCS storage
System information
Trino 449 Polaris git: 6fcf5ccaebd7ca13a0cb96c96adca699a24080a0