dlmarion commented 2 months ago

User reported this issue in the user mailing list. I wrote a small test (below) which shows that table creation gets progressively slower.

package org.apache.accumulo.test;

import java.time.Duration;

import org.apache.accumulo.core.client.Accumulo;
import org.apache.accumulo.core.client.AccumuloClient;
import org.apache.accumulo.harness.SharedMiniClusterBase;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Test;

public class CreateTableIT extends SharedMiniClusterBase {

  @Override
  protected Duration defaultTimeout() {
    return Duration.ofMinutes(5);
  }

  @BeforeAll
  public static void setup() throws Exception {
    SharedMiniClusterBase.startMiniCluster();
  }

  @AfterAll
  public static void teardown() {
    SharedMiniClusterBase.stopMiniCluster();
  }

  @Test
  public void testCreateLotsOfTables() throws Exception {
    try (AccumuloClient client = Accumulo.newClient().from(getClientProps()).build()) {

      String[] tableNames = getUniqueNames(1000);

      for (int i = 0; i < tableNames.length; i++) {
        // Create waits for the Fate operation to complete
        long start = System.currentTimeMillis();
        client.tableOperations().create(tableNames[i]);
        System.out.println("Table creation took: " + (System.currentTimeMillis() - start) + "ms");
      }
    }
  }

}

When I jstacked the Manager I noticed the following thread stack:

"Repo Runner-Worker-1" #100 daemon prio=5 os_prio=0 cpu=2402.44ms elapsed=78.24s tid=0x00007fa2b8002000 nid=0x59e9c runnable  [0x00007fa3acd7f000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.IOUtil.write1(java.base@11.0.22/Native Method)
    at sun.nio.ch.EPollSelectorImpl.wakeup(java.base@11.0.22/EPollSelectorImpl.java:254)
    - locked <0x00000000f178c2f0> (a java.lang.Object)
    at org.apache.zookeeper.ClientCnxnSocketNIO.wakeupCnxn(ClientCnxnSocketNIO.java:324)
    - locked <0x00000000f139b620> (a org.apache.zookeeper.ClientCnxnSocketNIO)
    at org.apache.zookeeper.ClientCnxnSocketNIO.packetAdded(ClientCnxnSocketNIO.java:315)
    at org.apache.zookeeper.ClientCnxn.queuePacket(ClientCnxn.java:1680)
    at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1565)
    at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1555)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1970)
    at org.apache.accumulo.core.fate.zookeeper.ZooCache$2.run(ZooCache.java:387)
    at org.apache.accumulo.core.fate.zookeeper.ZooCache$2.run(ZooCache.java:1)
    at org.apache.accumulo.core.fate.zookeeper.ZooCache$ZooRunnable.retry(ZooCache.java:247)
    at org.apache.accumulo.core.fate.zookeeper.ZooCache.get(ZooCache.java:406)
    at org.apache.accumulo.core.fate.zookeeper.ZooCache.get(ZooCache.java:338)
    at org.apache.accumulo.core.util.tables.TableMap.<init>(TableMap.java:72)
    at org.apache.accumulo.core.util.tables.TableZooHelper.lambda$0(TableZooHelper.java:116)
    at org.apache.accumulo.core.util.tables.TableZooHelper$$Lambda$438/0x00000008403e0840.call(Unknown Source)
    at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4955)
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2328)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2187)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2081)
    at com.google.common.cache.LocalCache.get(LocalCache.java:4036)
    at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4950)
    at org.apache.accumulo.core.util.tables.TableZooHelper.getCachedTableMap(TableZooHelper.java:116)
    at org.apache.accumulo.core.util.tables.TableZooHelper.getTableMap(TableZooHelper.java:106)
    at org.apache.accumulo.core.util.tables.TableZooHelper._getTableIdDetectNamespaceNotFound(TableZooHelper.java:74)
    at org.apache.accumulo.core.util.tables.TableZooHelper.getTableId(TableZooHelper.java:63)
    at org.apache.accumulo.core.clientImpl.ClientContext.getTableId(ClientContext.java:634)
    at org.apache.accumulo.core.clientImpl.ClientContext.createBatchWriter(ClientContext.java:756)
    at org.apache.accumulo.core.clientImpl.ClientContext.createBatchWriter(ClientContext.java:762)
    at org.apache.accumulo.server.metadata.TabletsMutatorImpl.getWriter(TabletsMutatorImpl.java:58)
    at org.apache.accumulo.server.metadata.TabletsMutatorImpl.mutateTablet(TabletsMutatorImpl.java:73)
    at org.apache.accumulo.server.metadata.ServerAmpleImpl.mutateTablet(ServerAmpleImpl.java:88)
    at org.apache.accumulo.server.util.MetadataTableUtil.addTablet(MetadataTableUtil.java:177)
    at org.apache.accumulo.manager.tableOps.create.PopulateMetadata.call(PopulateMetadata.java:68)
    at org.apache.accumulo.manager.tableOps.ManagerRepo.call(ManagerRepo.java:1)
    at org.apache.accumulo.manager.tableOps.TraceRepo.call(TraceRepo.java:60)
    at org.apache.accumulo.core.fate.Fate$TransactionRunner.run(Fate.java:100)
    at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.22/ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.22/ThreadPoolExecutor.java:628)
    at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
    at java.lang.Thread.run(java.base@11.0.22/Thread.java:829)

It looks like ServerAmpleImpl is trying to update the metadata table for the new current table, but the code has to go back to ZooKeeper to update ZooCache for the tableIdMap due to the prior table creation.

dlmarion commented 2 months ago

Reduced the calls to ZooCache in #4685, but was not able to remove all of them. Repopulating ZooCache with lots of tables is expensive, doing it in quick succession will progressively slow down. We might need a different code path, something like TableOperations.createTables, to create many tables vs creating a single table.

cshannon commented 2 months ago

I am not super familiar with the code for caching but it looks like the entire cache is invalidated on changes.

Would it be possible to user an observer and just have Zookeeper send notifications when the node changes for the table name/id? The cache could get updates for table creation and deletions.

keith-turner commented 2 months ago

Would it be possible to user an observer and just have Zookeeper send notifications when the node changes for the table name/id? The cache could get updates for table creation and deletions.

Discussed this with @cshannon offline. Suspect writing would need custom ZK code to keep the cache up to date. The goal is to have a Map<TableName, TableId> however in ZK we have Map<TableId, TableName> . With enough custome observers may be able to keep the Map<TableName, TableId> up to date based on changes in ZK.

An alternative approach may be to create Map<TableName, TableId> in zookeeper as an index conceptually. This would make the accumulo client code super simple as it could use zoocache for the Map<TableName, TableId> stored in zookeeper. The only problem is this index in ZK needs to be kept consistent. The consistency may not be a problem though because all table operations like create, import, clone, rename, and delete that would need to update this Map<TableName, TableId> index in zk already run as fate operations. So the updates to this index could easily be done via fate. If the code added to fate is simple then a lot of complex client code could be removed and it may result in a net simplification of the code base, not sure though. This would probably not be a change for 2.1.x initially.

keith-turner commented 2 months ago

Looking into how the fate operations work currently they call Utils.getTableNameLock().lock() to ensure only a single fate operation does anything with table names. If we ever allow fate operations to run on multiple processes, then this would not work very well as this lock assumes only one process runs fate operations at a time. If Map<TableName, TableId> exist in zookeeper, then we could possibly remove the need for Utils.getTableNameLock().lock() and use the ability to have atomic operations on the Map<TableName, TableId> in zookeeper to ensure that two tables are not created with the same name concurrently. So there may be an additional benefit to having Map<TableName, TableId> in zookeeper.

dlmarion commented 2 months ago

If we ever allow fate operations to run on multiple processes, then this would not work very well as this lock assumes only one process runs fate operations at a time.

If the table path in ZooKeeper included the namespace, then you could just lock the namespace when creating, cloning, renaming, or importing a table to ensure table names are unique. This would also reduce your search space in ZooKeeper.

dlmarion commented 2 months ago

4685 is merged and I think is a sufficient improvement for the 2.1.3 patch release. I'm going to close this, unless there are any objections.

keith-turner commented 2 months ago

https://github.com/apache/accumulo/pull/4685 is merged and I think is a sufficient improvement for the 2.1.3 patch release. I'm going to close this, unless there are any objections.

I opened an issue about creating an index in zookeeper for mapping table names to table ids. That seems to have multiple benefits and was something discussed related to this issue, but no need to keep this issue open for that.

elahrvivaz commented 2 months ago

thanks for the quick turn-around!

apache / accumulo

Table creation gets progressively slower when creating a lot of tables #4684

4685 is merged and I think is a sufficient improvement for the 2.1.3 patch release. I'm going to close this, unless there are any objections.