blaze / odo

Data Migration for the Blaze Project
http://odo.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
1k stars 138 forks source link

Unable to load sharded data from files without filetype extensions #573

Open mhlr opened 7 years ago

mhlr commented 7 years ago

I am trying to load data from directory of jsonlines formatted files which lack the .json extension.

I have tried:

data('/path/to/dir/')
data('/path/to/dir/*')
data(JSONLines('/path/to/dir/'))
data(JSONLines('/path/to/dir/*'))
data(Directory(JSONLines)('/path/to/dir/'))
data(Directory(JSONLines)('/path/to/dir/*'))

all of which throw either Unable to parse uri to data resource or No such file or directory.

I am able to parse a single file with:

data(JSONLines('/path/to/dir/file1'))

Is this a bug / unimplemented functionality or am I doing something wrong?

llllllllll commented 7 years ago

When using just data, blaze delegates to odo.resource which uses a sequence of regular expressions to resolve the uri to a type. If there is no extension, you will need to manually construct the box type (for example JSONLines) so odo and blaze know what the uri is.

My intuition is that data(Directory(JSONLines)('/path/to/dir/')) is the correct call, does that produce No such file or directory? If so, can you confirm that the path actually exists? Also, maybe try removing the trailing slash. If the trailing slash fixes the problem, that is certainly a bug.

mhlr commented 7 years ago

@llllllllll

I have the files

/home/dm/wikipedia/AA/wiki_00
...
/home/dm/wikipedia/AA/wiki_99

When I run

d = data(Directory(JSON)('/home/dm/wikipedia/AA/'))

I get

NotImplementedError                       Traceback (most recent call last)
<ipython-input-1-a1471c617821> in <module>()
----> 1 import codecs, os;__pyfile = codecs.open('''/tmp/py7956cPk''', encoding='''utf-8''');__code = __pyfile.read().encode('''utf-8''');__pyfile.close();os.remove('''/tmp/py7956cPk''');exec(compile(__code, '''/home/dm/Scripts/vndf.py''', 'exec'));

/home/dm/Scripts/vndf.py in <module>()
     14 d = data(Directory(JSONLines)('/home/dm/wikipedia/AA/wiki_01'))
     15 
---> 16 #d = data(Directory(JSONLines)('/home/dm/wikipedia/AA/'))
     17 
     18 #d = data(Directory(JSONLines)('/home/dm/wikipedia/AA/.*'))

/home/dm/anaconda3/lib/python3.6/site-packages/blaze/interactive.py in data(data_source, dshape, name, fields, schema, **kwargs)
    151         dshape = datashape.dshape(dshape)
    152     if not dshape:
--> 153         dshape = discover(data_source)
    154         types = None
    155         if isinstance(dshape.measure, Tuple) and fields:

/home/dm/anaconda3/lib/python3.6/site-packages/multipledispatch/dispatcher.py in __call__(self, *args, **kwargs)
    162             self._cache[types] = func
    163         try:
--> 164             return func(*args, **kwargs)
    165 
    166         except MDNotImplementedError:

/home/dm/anaconda3/lib/python3.6/site-packages/odo/directory.py in discover_Directory(c, **kwargs)
     48 @discover.register(_Directory)
     49 def discover_Directory(c, **kwargs):
---> 50     return var * discover(first(c)).subshape[0]
     51 
     52 

/home/dm/anaconda3/lib/python3.6/site-packages/toolz/itertoolz.py in first(seq)
    366     'A'
    367     """
--> 368     return next(iter(seq))
    369 
    370 

/home/dm/anaconda3/lib/python3.6/site-packages/odo/directory.py in <genexpr>(.0)
     32     def __iter__(self):
     33         return (resource(os.path.join(self.path, fn), **self.kwargs)
---> 34                     for fn in sorted(os.listdir(self.path)))
     35 
     36 

/home/dm/anaconda3/lib/python3.6/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs)
     89 
     90     def __call__(self, s, *args, **kwargs):
---> 91         return self.dispatch(s)(s, *args, **kwargs)
     92 
     93     @property

/home/dm/anaconda3/lib/python3.6/site-packages/odo/resource.py in resource_all(uri, *args, **kwargs)
     98     discover
     99     """
--> 100     raise NotImplementedError("Unable to parse uri to data resource: " + uri)
    101 
    102 

NotImplementedError: Unable to parse uri to data resource: /home/dm/wikipedia/AA/wiki_00

Note that the error message contains the name of a specific file, so blaze is seeing the directoruy and the files therein. It is just getting confused somehow.

mhlr commented 7 years ago

@llllllllll

Leaving of the final '/' makes no difference.

llllllllll commented 7 years ago

Ah, it looks like Directory isn't respecting that you have told it the type of the resource already:

/home/dm/anaconda3/lib/python3.6/site-packages/odo/directory.py in <genexpr>(.0)
     32     def __iter__(self):
     33         return (resource(os.path.join(self.path, fn), **self.kwargs)
---> 34                     for fn in sorted(os.listdir(self.path)))
     35 
     36 

this is the incorrect frame. The basic idea is that we need to treat "bound" Directory subclasses different in __iter__. It is pretty late here but I should be able to fix this tomorrow.

mhlr commented 7 years ago

@llllllllll Thanks

mhlr commented 7 years ago

What would be the way to supply type information when using a file pattern rather than the whole directory, eg.:

data('/home/dm/wikipedia/AA/wiki_0*')
llllllllll commented 7 years ago

The call I showed before is the correct way to do it, it is just broken. I'm not sure there is a simple workaround other than adding an extension. This should be a small fix though.

I am about to go to sleep but I'll fix this tomorrow.

mhlr commented 7 years ago

@llllllllll Cool, Thanks! I think this is not json specific though. I have tried thing like Directory(TextFile)) and also Directory(Directory(JsonLines)) pointed at the parent directory and both exhibit the same problem. That makes me think that it is primarily a Directory problem. I the second case it the error was about the inner directory, it did not reach through to the file before failing. I wonder a similar problem also affects some of the other modifiers like S3, SSH and HDFS.