Missing glob support when reading files

mtsargent commented 4 years ago

When reading multiple files at once with Spark, I would expect to use wildcards/other general glob patterns (similar to the answer https://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd/24036343). Example repeated here for simplicity:

sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")

When using Stocator, attempting to read files in this way fails: val junkcsv = spark.sqlContext.read.option("header", "true").load("cos://some-bucket.myCos/somefile.csv/part-0000[0-1]*")

Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: cos://some-bucket.myCos/somefile.csv/part-0000[0-1]*;

This failure happens even when there are files I would expect to match that pattern like:

cos://some-bucket.myCos/somefile.csv/part-00000.csv cos://some-bucket.myCos/somefile.csv/part-00001.csv

The lack of glob support seems to be coming from the ObjectStoreFlatGlobFilter class: https://github.com/CODAIT/stocator/blob/c18f37b6dfc119e5ebfd2bf12c57de989e4a5ad5/src/main/java/com/ibm/stocator/fs/common/ObjectStoreFlatGlobFilter.java#L128-L134

The only type of matching attempted is a simple wildcard match, rather than an actual attempt at globbing.

The java.nio package may be able to support this type of matching. I have not yet built a custom version of Stocator, but the following matching code seems promising:

PathMatcher pm = FileSystems.getDefault().getPathMatcher("glob:" + pathPattern.replaceAll("//", "/"));
Path newPath = FileSystems.getDefault().getPath(pathStr);

match = pm.matches(newPath);

I am not familiar enough with the rest of the Stocator codebase to know if adding in this type of matching breaks other parts of the code drastically.

gilv commented 4 years ago

@mtsargent you are not suppose to access parts of the file. This is general Hadoop eco-system usage. Parts are internal files, that were created by distributed tasks. You should never access parts directly, rather you need to use ("cos://some-bucket.myCos/somefile.csv") and then globber is supported of course.

mtsargent commented 4 years ago

Fair point about part files, but would you anticipate the stocator globber to work with non-part files?

Suppose I try to use this to read in multiple files:

"cos://some-bucket.myCos/file-00[0-2]*"

Would you expect this to read in all of the following from my COS bucket?

file-000.txt file-001.txt file-002.txt

While also ignoring other files. Example:

file-003.txt file-004.txt

I suppose I can just set up this scenario and test it out.

gilv commented 4 years ago

@mtsargent i expect exactly as you wrote. if this doesn't work, then it's a bug in Stocator and need to be fixed of course.

gilv commented 4 years ago

@mtsargent however it's not clear how to make ranges in [x-y]...if it's numeric or literal is important to know. for example, [aaxy-xyba], what you expect to have? there might be thousands of objects, how to identify them? or you need only numeric, [1-100], will be 1,2,..,99,100?

mtsargent commented 4 years ago

I think each expression in brackets only corresponds to a single character. The syntax I am familiar with is described here: http://man7.org/linux/man-pages/man7/glob.7.html. [aaxy-xyba] would be the same as a single character match out of [abxy], and [1-100] would be a single character match the same as writing [01] or [0-1].

At the very least, I can set up this test next time I am around my work computer. I can update this issue one way or the other (and can close the issue if matching works as expected).

gilv commented 4 years ago

@mtsargent thanks. I think we support {} right now, [] is not supported, but i need double check. At least i don't see unitests for [], only for {} https://github.com/CODAIT/stocator/blob/master/src/test/java/com/ibm/stocator/fs/cos/systemtests/TestCOSGlobberBracketStocator.java

Will you be able to extend code to support also [] ? will be great if you can work on it..

mtsargent commented 4 years ago

This may be something I can try to take on. It likely wouldn't be for a few weeks at the earliest.

CODAIT / stocator

Missing glob support when reading files #223