Open mtsargent opened 4 years ago
@mtsargent you are not suppose to access parts of the file. This is general Hadoop eco-system usage. Parts are internal files, that were created by distributed tasks. You should never access parts directly, rather you need to use ("cos://some-bucket.myCos/somefile.csv") and then globber is supported of course.
Fair point about part files, but would you anticipate the stocator globber to work with non-part files?
Suppose I try to use this to read in multiple files:
"cos://some-bucket.myCos/file-00[0-2]*"
Would you expect this to read in all of the following from my COS bucket?
file-000.txt file-001.txt file-002.txt
While also ignoring other files. Example:
file-003.txt file-004.txt
I suppose I can just set up this scenario and test it out.
@mtsargent i expect exactly as you wrote. if this doesn't work, then it's a bug in Stocator and need to be fixed of course.
@mtsargent however it's not clear how to make ranges in [x-y]...if it's numeric or literal is important to know. for example, [aaxy-xyba], what you expect to have? there might be thousands of objects, how to identify them? or you need only numeric, [1-100], will be 1,2,..,99,100?
I think each expression in brackets only corresponds to a single character. The syntax I am familiar with is described here: http://man7.org/linux/man-pages/man7/glob.7.html. [aaxy-xyba] would be the same as a single character match out of [abxy], and [1-100] would be a single character match the same as writing [01] or [0-1].
At the very least, I can set up this test next time I am around my work computer. I can update this issue one way or the other (and can close the issue if matching works as expected).
@mtsargent thanks. I think we support {} right now, [] is not supported, but i need double check. At least i don't see unitests for [], only for {} https://github.com/CODAIT/stocator/blob/master/src/test/java/com/ibm/stocator/fs/cos/systemtests/TestCOSGlobberBracketStocator.java
Will you be able to extend code to support also [] ? will be great if you can work on it..
This may be something I can try to take on. It likely wouldn't be for a few weeks at the earliest.
When reading multiple files at once with Spark, I would expect to use wildcards/other general glob patterns (similar to the answer https://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd/24036343). Example repeated here for simplicity:
When using Stocator, attempting to read files in this way fails: val junkcsv = spark.sqlContext.read.option("header", "true").load("cos://some-bucket.myCos/somefile.csv/part-0000[0-1]*")
This failure happens even when there are files I would expect to match that pattern like:
The lack of glob support seems to be coming from the ObjectStoreFlatGlobFilter class: https://github.com/CODAIT/stocator/blob/c18f37b6dfc119e5ebfd2bf12c57de989e4a5ad5/src/main/java/com/ibm/stocator/fs/common/ObjectStoreFlatGlobFilter.java#L128-L134
The only type of matching attempted is a simple wildcard match, rather than an actual attempt at globbing.
The java.nio package may be able to support this type of matching. I have not yet built a custom version of Stocator, but the following matching code seems promising:
I am not familiar enough with the rest of the Stocator codebase to know if adding in this type of matching breaks other parts of the code drastically.