CODAIT / stocator

Stocator is high performing connector to object storage for Apache Spark, achieving performance by leveraging object storage semantics.
Apache License 2.0
112 stars 72 forks source link

Missing glob support when reading files #223

Open mtsargent opened 4 years ago

mtsargent commented 4 years ago

When reading multiple files at once with Spark, I would expect to use wildcards/other general glob patterns (similar to the answer https://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd/24036343). Example repeated here for simplicity:

sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")

When using Stocator, attempting to read files in this way fails: val junkcsv = spark.sqlContext.read.option("header", "true").load("cos://some-bucket.myCos/somefile.csv/part-0000[0-1]*")

Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: cos://some-bucket.myCos/somefile.csv/part-0000[0-1]*;

This failure happens even when there are files I would expect to match that pattern like:

cos://some-bucket.myCos/somefile.csv/part-00000.csv cos://some-bucket.myCos/somefile.csv/part-00001.csv

The lack of glob support seems to be coming from the ObjectStoreFlatGlobFilter class: https://github.com/CODAIT/stocator/blob/c18f37b6dfc119e5ebfd2bf12c57de989e4a5ad5/src/main/java/com/ibm/stocator/fs/common/ObjectStoreFlatGlobFilter.java#L128-L134

The only type of matching attempted is a simple wildcard match, rather than an actual attempt at globbing.

The java.nio package may be able to support this type of matching. I have not yet built a custom version of Stocator, but the following matching code seems promising:

PathMatcher pm = FileSystems.getDefault().getPathMatcher("glob:" + pathPattern.replaceAll("//", "/"));
Path newPath = FileSystems.getDefault().getPath(pathStr);

match = pm.matches(newPath);

I am not familiar enough with the rest of the Stocator codebase to know if adding in this type of matching breaks other parts of the code drastically.

gilv commented 4 years ago

@mtsargent you are not suppose to access parts of the file. This is general Hadoop eco-system usage. Parts are internal files, that were created by distributed tasks. You should never access parts directly, rather you need to use ("cos://some-bucket.myCos/somefile.csv") and then globber is supported of course.

mtsargent commented 4 years ago

Fair point about part files, but would you anticipate the stocator globber to work with non-part files?

Suppose I try to use this to read in multiple files:

"cos://some-bucket.myCos/file-00[0-2]*"

Would you expect this to read in all of the following from my COS bucket?

file-000.txt file-001.txt file-002.txt

While also ignoring other files. Example:

file-003.txt file-004.txt

I suppose I can just set up this scenario and test it out.

gilv commented 4 years ago

@mtsargent i expect exactly as you wrote. if this doesn't work, then it's a bug in Stocator and need to be fixed of course.

gilv commented 4 years ago

@mtsargent however it's not clear how to make ranges in [x-y]...if it's numeric or literal is important to know. for example, [aaxy-xyba], what you expect to have? there might be thousands of objects, how to identify them? or you need only numeric, [1-100], will be 1,2,..,99,100?

mtsargent commented 4 years ago

I think each expression in brackets only corresponds to a single character. The syntax I am familiar with is described here: http://man7.org/linux/man-pages/man7/glob.7.html. [aaxy-xyba] would be the same as a single character match out of [abxy], and [1-100] would be a single character match the same as writing [01] or [0-1].

At the very least, I can set up this test next time I am around my work computer. I can update this issue one way or the other (and can close the issue if matching works as expected).

gilv commented 4 years ago

@mtsargent thanks. I think we support {} right now, [] is not supported, but i need double check. At least i don't see unitests for [], only for {} https://github.com/CODAIT/stocator/blob/master/src/test/java/com/ibm/stocator/fs/cos/systemtests/TestCOSGlobberBracketStocator.java

Will you be able to extend code to support also [] ? will be great if you can work on it..

mtsargent commented 4 years ago

This may be something I can try to take on. It likely wouldn't be for a few weeks at the earliest.