apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
1.09k stars 343 forks source link

[Bug report] GFS failed to be fully compatible with the Hadoop rm command line. #5564

Closed baitian77 closed 1 week ago

baitian77 commented 1 week ago

Version

0.6.0

Describe what's wrong

  1. failed to execute: hadoop fs -rm gvfs://fileset/catalog/schema/hdfs_fileset_1/jstack-21727
  2. The Hadoop rm command is translated into a mv operation to the trash.
  3. gravitino server will filed to load the trash path fileset (and we cannot create the fileset with name of .Trash)

Error message and/or stacktrace

-rm: Fatal internal error

java.lang.RuntimeException: Cannot load fileset: metalake_stg01.user.root..Trash from the server. exception: Failed to operate catalog(s) [user] operation [LOAD] under metalake [metalake_stg01], reason [Catalog metalake_stg01.user does not exist]

org.apache.gravitino.exceptions.NoSuchCatalogException: Catalog metalake_stg01.user does not exist
        at org.apache.gravitino.catalog.CatalogManager.loadCatalogInternal(CatalogManager.java:649)
        at org.apache.gravitino.shaded.com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
        at org.apache.gravitino.shaded.com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
        at org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem.getFilesetContext(GravitinoVirtualFileSystem.java:386)
        at org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem.mkdirs(GravitinoVirtualFileSystem.java:545)
        at org.apache.hadoop.fs.TrashPolicyDefault.moveToTrash(TrashPolicyDefault.java:147)
        at org.apache.hadoop.fs.Trash.moveToTrash(Trash.java:110)
        at org.apache.hadoop.fs.Trash.moveToAppropriateTrash(Trash.java:96)
        at org.apache.hadoop.fs.shell.Delete$Rm.moveToTrash(Delete.java:153)
        at org.apache.hadoop.fs.shell.Delete$Rm.processPath(Delete.java:118)
        at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:367)
        at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
        at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:304)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
        at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)

How to reproduce

  1. create a hdfs fileset
  2. delete the fileset/{sub_file}

Additional context

No response

xloya commented 1 week ago

@baitian77 Hi, I think your example and the error message do not exactly match. In your example the fileset virtual path is gvfs://fileset/catalog/schema/hdfs_fileset_1/jstack-21727, this will parse the file catalog name to the catalog, which not the user in the exception. According to the exception, I think you do not create the user fileset catalog firstly.

xloya commented 1 week ago

Besides, rm command deleting to trash should be the default logic of hadoop delete shell command. If you do not need to delete it to trash, you can add a parameter: -skipTrash like hadoop dfs -rm -skipTrash gvfs://fileset/{catalog}/{schema}/{fileset_name}/sub_dir. image image

baitian77 commented 1 week ago

Besides, rm command deleting to trash should be the default logic of hadoop delete shell command. If you do not need to delete it to trash, you can add a parameter: -skipTrash like hadoop dfs -rm -skipTrash gvfs://fileset/{catalog}/{schema}/{fileset_name}/sub_dir. image image

@xloya To ensure data security, the-skipTrashoption has been forcibly disabled, so deleted data will first go to the trash to prevent accidental deletion. How can fileset be compatible with the hadoop fs -rmcommand? thanks

xloya commented 1 week ago

Besides, rm command deleting to trash should be the default logic of hadoop delete shell command. If you do not need to delete it to trash, you can add a parameter: -skipTrash like hadoop dfs -rm -skipTrash gvfs://fileset/{catalog}/{schema}/{fileset_name}/sub_dir. image image

@xloya To ensure data security, the-skipTrashoption has been forcibly disabled, so deleted data will first go to the trash to prevent accidental deletion. How can fileset be compatible with the hadoop fs -rmcommand? thanks

Hi, if you could not use -skipTrash option, I think there may not have a way to delete directly unless you modified the hadoop source code about rm command. Another option is to use Hadoop FileSystem to delete in Java / Scala code. I think this will delete the directory or file directly instead of deleting it to trash:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class Test {
    public static void main(String[] args) throws IOException {
        Path filesetPath = new Path("gvfs://fileset/{your_catalog}/{your_schema}/{your_fileset}/sub_path");
        try (FileSystem fs = filesetPath.getFileSystem(new Configuration())) {
          fs.delete(filesetPath, true);
        }
    }
}
baitian77 commented 1 week ago

Besides, rm command deleting to trash should be the default logic of hadoop delete shell command. If you do not need to delete it to trash, you can add a parameter: -skipTrash like hadoop dfs -rm -skipTrash gvfs://fileset/{catalog}/{schema}/{fileset_name}/sub_dir. image image

@xloya To ensure data security, the-skipTrashoption has been forcibly disabled, so deleted data will first go to the trash to prevent accidental deletion. How can fileset be compatible with the hadoop fs -rmcommand? thanks

Hi, if you could not use -skipTrash option, I think there may not have a way to delete directly unless you modified the hadoop source code about rm command. Another option is to use Hadoop FileSystem to delete in Java / Scala code. I think this will delete the directory or file directly instead of deleting it to trash:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class Test {
    public static void main(String[] args) throws IOException {
        Path filesetPath = new Path("gvfs://fileset/{your_catalog}/{your_schema}/{your_fileset}/sub_path");
        try (FileSystem fs = filesetPath.getFileSystem(new Configuration())) {
          fs.delete(filesetPath, true);
        }
    }
}

Sure, thank you for your response.