apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.35k stars 2.2k forks source link

InMemoryCatalog's FiloIO in memory map isn't persistent in RestCatalog #9604

Closed geruh closed 8 months ago

geruh commented 8 months ago

Query engine

None

Question

We are seeing some unexpected behavior when testing the RESTCatalog. The RESTCatalog is intended to use an InMemoryCatalog for testing purposes. It appears that the RESTCatalog and InMemoryCatalog are using separate instances of InMemoryFileIO, even though they point to the same warehouse location which makes sense. However, our expectation was that the RESTCatalog would share the same InMemoryCatalog instance (and InMemoryFileIO instance) across all tests. This would ensure the files associated with tables are accessible and consistent between the backed catalog and RESTCatalog.

It seems that when instantiating the RESTCatalog, a new InMemoryFileIO instance is also created separately. As a result, test data is not actually shared between the two catalogs. For instance, when committing data to a table, the InMemoryCatalog's FileIO stores the table metadata since the CatalogHandler delegates metadata creation. But when creating the table, the RESTCatalog's separate InMemoryFileIO instance is returned as part of RESTTableOperations.

This leaves the InMemoryCatalog storing table metadata, while the RESTCatalog's IO stores Snapshot and ManifestList data separately.

Is this the intended behavior? Or did we expect the RESTCatalog to share the same InMemoryCatalog and InMemoryFileIO instances during testing?

@nastra @jackye1995

nastra commented 8 months ago

I don't think this was intended behavior.

It seems that when instantiating the RESTCatalog, a new InMemoryFileIO instance is also created separately.

Could you point out where this happens exactly?

geruh commented 8 months ago

Based on my current understanding, the InMemoryCatalog serves as the supporting catalog for the RESTCatalog. When a table is created the RESTTableOperations, returns a ResolvingFileIO object that encapsulates a HadoopFileIO instance. In scenarios like the one demonstrated by the testAppendFiles method (within the REST catalog context), after committing changes, it appears that only InMemoryFileIO is used for referencing metadata files.

Screenshot 2024-02-05 at 8 47 05 PM

However, attempting to access these metadata files with the REST table's HadoopFileIO results in a NotFoundException. This issue does not occur when accessing the files through the backendCatalog, indicating that HadoopFileIO can access files that InMemoryFileIO cannot.

ResolvingFileIO

Screenshot 2024-02-05 at 8 52 05 PM

InMemoryFileIO

Screenshot 2024-02-05 at 8 53 31 PM

Additionally, when attempting to read a manifest list or snapshot created on the table using InMemoryFileIO, a NotFoundException is also encountered.

Screenshot 2024-02-05 at 8 56 28 PM

It's also worth noting that the test methods utilizing this functionality in the TestRESTCatalog class initiate a new instance of InMemoryFileIO for each test, following the conf passed in to the catalog.

https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java#L2323

geruh commented 8 months ago

Synced with @nastra through Slack.

This is unexpected behavior, the RESTCatalog used to be backed by the JdbcCatalog where the ResolvingFileIO knew how to handle this Catalog. However, according to this pull-request the JdbcCatalog didn't have view support, therefore it was swapped out with the InMemoryCatalog.

We have at least two potential courses for moving forward, either we wait for the view support efforts in the JdbcCatalog in this pull-request and revert the changes. Or we fix this behavior in the RESTCatalogTests which is suggested by @nastra.

cc: @jackye1995 @amogh-jahagirdar

nastra commented 8 months ago

+1 to fixing the behavior in TestRESTCatalog when using InMemoryCatalog

jbonofre commented 8 months ago

It makes sense to improve the TestRESTCatalog with InMemoryCatalog backend.