Open mhlr opened 7 years ago
When using just data
, blaze delegates to odo.resource
which uses a sequence of regular expressions to resolve the uri to a type. If there is no extension, you will need to manually construct the box type (for example JSONLines
) so odo and blaze know what the uri is.
My intuition is that data(Directory(JSONLines)('/path/to/dir/'))
is the correct call, does that produce No such file or directory
? If so, can you confirm that the path actually exists? Also, maybe try removing the trailing slash. If the trailing slash fixes the problem, that is certainly a bug.
@llllllllll
I have the files
/home/dm/wikipedia/AA/wiki_00
...
/home/dm/wikipedia/AA/wiki_99
When I run
d = data(Directory(JSON)('/home/dm/wikipedia/AA/'))
I get
NotImplementedError Traceback (most recent call last)
<ipython-input-1-a1471c617821> in <module>()
----> 1 import codecs, os;__pyfile = codecs.open('''/tmp/py7956cPk''', encoding='''utf-8''');__code = __pyfile.read().encode('''utf-8''');__pyfile.close();os.remove('''/tmp/py7956cPk''');exec(compile(__code, '''/home/dm/Scripts/vndf.py''', 'exec'));
/home/dm/Scripts/vndf.py in <module>()
14 d = data(Directory(JSONLines)('/home/dm/wikipedia/AA/wiki_01'))
15
---> 16 #d = data(Directory(JSONLines)('/home/dm/wikipedia/AA/'))
17
18 #d = data(Directory(JSONLines)('/home/dm/wikipedia/AA/.*'))
/home/dm/anaconda3/lib/python3.6/site-packages/blaze/interactive.py in data(data_source, dshape, name, fields, schema, **kwargs)
151 dshape = datashape.dshape(dshape)
152 if not dshape:
--> 153 dshape = discover(data_source)
154 types = None
155 if isinstance(dshape.measure, Tuple) and fields:
/home/dm/anaconda3/lib/python3.6/site-packages/multipledispatch/dispatcher.py in __call__(self, *args, **kwargs)
162 self._cache[types] = func
163 try:
--> 164 return func(*args, **kwargs)
165
166 except MDNotImplementedError:
/home/dm/anaconda3/lib/python3.6/site-packages/odo/directory.py in discover_Directory(c, **kwargs)
48 @discover.register(_Directory)
49 def discover_Directory(c, **kwargs):
---> 50 return var * discover(first(c)).subshape[0]
51
52
/home/dm/anaconda3/lib/python3.6/site-packages/toolz/itertoolz.py in first(seq)
366 'A'
367 """
--> 368 return next(iter(seq))
369
370
/home/dm/anaconda3/lib/python3.6/site-packages/odo/directory.py in <genexpr>(.0)
32 def __iter__(self):
33 return (resource(os.path.join(self.path, fn), **self.kwargs)
---> 34 for fn in sorted(os.listdir(self.path)))
35
36
/home/dm/anaconda3/lib/python3.6/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs)
89
90 def __call__(self, s, *args, **kwargs):
---> 91 return self.dispatch(s)(s, *args, **kwargs)
92
93 @property
/home/dm/anaconda3/lib/python3.6/site-packages/odo/resource.py in resource_all(uri, *args, **kwargs)
98 discover
99 """
--> 100 raise NotImplementedError("Unable to parse uri to data resource: " + uri)
101
102
NotImplementedError: Unable to parse uri to data resource: /home/dm/wikipedia/AA/wiki_00
Note that the error message contains the name of a specific file, so blaze is seeing the directoruy and the files therein. It is just getting confused somehow.
@llllllllll
Leaving of the final '/' makes no difference.
Ah, it looks like Directory isn't respecting that you have told it the type of the resource already:
/home/dm/anaconda3/lib/python3.6/site-packages/odo/directory.py in <genexpr>(.0)
32 def __iter__(self):
33 return (resource(os.path.join(self.path, fn), **self.kwargs)
---> 34 for fn in sorted(os.listdir(self.path)))
35
36
this is the incorrect frame.
The basic idea is that we need to treat "bound" Directory
subclasses different in __iter__
. It is pretty late here but I should be able to fix this tomorrow.
@llllllllll Thanks
What would be the way to supply type information when using a file pattern rather than the whole directory, eg.:
data('/home/dm/wikipedia/AA/wiki_0*')
The call I showed before is the correct way to do it, it is just broken. I'm not sure there is a simple workaround other than adding an extension. This should be a small fix though.
I am about to go to sleep but I'll fix this tomorrow.
@llllllllll Cool, Thanks! I think this is not json specific though.
I have tried thing like Directory(TextFile))
and also Directory(Directory(JsonLines))
pointed at the parent directory and both exhibit the same problem. That makes me think that it is primarily a Directory
problem.
I the second case it the error was about the inner directory, it did not reach through to the file before failing.
I wonder a similar problem also affects some of the other modifiers like S3
, SSH
and HDFS
.
I am trying to load data from directory of jsonlines formatted files which lack the .json extension.
I have tried:
all of which throw either
Unable to parse uri to data resource
orNo such file or directory
.I am able to parse a single file with:
Is this a bug / unimplemented functionality or am I doing something wrong?