apache / incubator-uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.
https://uniffle.apache.org/
Apache License 2.0
376 stars 145 forks source link

[Bug] A more elegant way to delete files is needed #1769

Open rickyma opened 3 months ago

rickyma commented 3 months ago

Code of Conduct

Search before asking

Describe the bug

We need a more elegant way to delete files, rather than deleting them from the local disk first and then from the hdfs every time.

Affects Version(s)

master

Uniffle Server Log Output

[2024-06-07 21:33:45.064] [checkResource-0] [WARN] ShuffleTaskManager.preAllocatedBufferCheck - Remove expired preAllocatedBuffer[id=8311808] that required by app: application_1703049085550_12962744_1717766212505
[2024-06-07 21:33:45.064] [expiredAppCleaner-0] [INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[application_1703049085550_12962744_1717766212505] according to rss.server.app.expired.withoutHeartbeat
[2024-06-07 21:33:45.065] [clearResourceThread] [INFO] ShuffleTaskManager.removeResources - Start remove resource for appId[application_1703049085550_12962744_1717766212505]
[2024-06-07 21:33:45.268] [clearResourceThread] [INFO] HybridStorageManager.removeResources - Start to remove resource of AppPurgeEvent{appId='application_1703049085550_12962744_1717766212505', user='aaa', shuffleIds=[0]}
[2024-06-07 21:33:45.269] [clearResourceThread] [INFO] LocalStorageManager.cleanupStorageSelectionCache - Cleaning the storage selection cache costs: 1(ms) for event: AppPurgeEvent{appId='application_1703049085550_12962744_1717766212505', user='aaa', shuffleIds=[0]}
[2024-06-07 21:33:45.269] [clearResourceThread] [INFO] LocalStorage.removeResources - Start to remove resource of application_1703049085550_12962744_1717766212505/0
[2024-06-07 21:33:45.269] [clearResourceThread] [INFO] LocalStorage.removeResources - Finish remove resource of application_1703049085550_12962744_1717766212505/0, disk size is 0 and 0 shuffle metadata
[2024-06-07 21:33:54.505] [clearResourceThread] [INFO] LocalFileDeleteHandler.delete - Delete shuffle data for appId[application_1703049085550_12962744_1717766212505] with /data1/rssdata/application_1703049085550_12962744_1717766212505 cost 9236 ms
[2024-06-07 21:33:54.505] [clearResourceThread] [INFO] HadoopShuffleDeleteHandler.delete - Try delete shuffle data in Hadoop FS for appId[application_1703049085550_12962744_1717766212505] of user[aaa] with hdfs://xxx/rss/online/application_1703049085550_12962744_1717766212505
[2024-06-07 21:33:54.600] [clearResourceThread] [WARN] HadoopShuffleDeleteHandler.delete - Can't delete shuffle data for appId[application_1703049085550_12962744_1717766212505] with 1 times
java.io.FileNotFoundException: File hdfs://xxx/rss/online/application_1703049085550_12962744_1717766212505 does not exist.
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:993)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$800(DistributedFileSystem.java:120)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1053)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1060)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:101)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:61)
        at org.apache.uniffle.server.storage.HadoopStorageManager.removeResources(HadoopStorageManager.java:125)
        at org.apache.uniffle.server.storage.HybridStorageManager.removeResources(HybridStorageManager.java:162)
        at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:775)
        at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
        at java.lang.Thread.run(Thread.java:750)
[2024-06-07 21:33:55.636] [clearResourceThread] [WARN] HadoopShuffleDeleteHandler.delete - Can't delete shuffle data for appId[application_1703049085550_12962744_1717766212505] with 2 times
java.io.FileNotFoundException: File hdfs://xxx/rss/online/application_1703049085550_12962744_1717766212505 does not exist.
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:993)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$800(DistributedFileSystem.java:120)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1053)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1060)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:101)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:61)
        at org.apache.uniffle.server.storage.HadoopStorageManager.removeResources(HadoopStorageManager.java:125)
        at org.apache.uniffle.server.storage.HybridStorageManager.removeResources(HybridStorageManager.java:162)
        at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:775)
        at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
        at java.lang.Thread.run(Thread.java:750)
[2024-06-07 21:33:56.672] [clearResourceThread] [WARN] HadoopShuffleDeleteHandler.delete - Can't delete shuffle data for appId[application_1703049085550_12962744_1717766212505] with 3 times
java.io.FileNotFoundException: File hdfs://xxx/rss/online/application_1703049085550_12962744_1717766212505 does not exist.
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:993)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$800(DistributedFileSystem.java:120)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1053)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1060)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:101)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:61)
        at org.apache.uniffle.server.storage.HadoopStorageManager.removeResources(HadoopStorageManager.java:125)
        at org.apache.uniffle.server.storage.HybridStorageManager.removeResources(HybridStorageManager.java:162)
        at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:775)
        at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
        at java.lang.Thread.run(Thread.java:750)

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

EnricoMi commented 3 months ago

Can you elaborate on the issue a bite more please? What is the current behaviour and what is not elegant about it.

rickyma commented 3 months ago

When cleaning expired resources, no matter it is an HDFS file or a normal disk file, we always do the following things in HybridStorageManager.removeResources, that's why en exception is thrown:

public void removeResources(PurgeEvent event) {
  LOG.info("Start to remove resource of {}", event);
  warmStorageManager.removeResources(event);
  coldStorageManager.removeResources(event);
}
EnricoMi commented 3 months ago

You are saying we should not attempt to delete from any storage if the data is not stored there? This means we need to keep track where the data reside.

rickyma commented 3 months ago

Yeah, it's better this way, so we can reduce a lot of meaningless warn logs.