Python db.load raise exception when loading from urlopen()

edigiacomo commented 9 years ago

Starting a web server from top src dir:

$ python -m SimpleHTTPServer 8000

The code sometimes raise KeyError: 'could not detect the encoding of ' or OSError: reading a 5505024-bytes record from : Illegal seek.

# test.py
import dballe
from urllib2 import urlopen
from glob import glob

db = dballe.DB.connect_from_file("/tmp/buttami.db")
db.reset()
for f in glob("extra/bufr/*.bufr"):
    r = urlopen("http://localhost:8000/{}".format(f))
    db.load(r)

Inspecting with gdb, it seems that sometimes the encoding is not dected the line int c = getc(stream); return 255 (encoding not detected) or 0 (create a AOF file):

$ gdb python
(gdb) l dballe/file.cc:98
93      if (c == EOF)
94          return create(BUFR, st.release(), close_on_exit, name);
95  
96      if (ungetc(c, stream) == EOF)
97          error_system::throwf("cannot put the first byte of %s back into the input stream", name.c_str());
98  
99      switch (c)
100     {
101         case 'B': return create(BUFR, st.release(), close_on_exit, name);
102         case 'C': return create(CREX, st.release(), close_on_exit, name);
(gdb) b dballe/file.cc:98
(gdb) r test.py
Starting program: /usr/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, dballe::File::create (stream=0x7fbc60, close_on_exit=close_on_exit@entry=true, name="") at file.cc:99
99      switch (c)
(gdb) p c
$1 = 0

spanezz commented 9 years ago

Auto detection of encoding is done only on the first byte of the input, but there are examples in extra/bufr that have other leading data at the beginning of the file, which confuses autodetection.

I have committed the option to specify the encoding from python explicitly:

In [3]: db.load?
Type:        builtin_function_or_method
String form: <built-in method load of dballe.DB object at 0x7f7aaa6d2290>
Docstring:
load(fp, encoding=None)

Load a file object in the database. An encoding can optionally be
provided as a string ("BUFR", "CREX", "AOF"). If encoding is None then
load will try to autodetect based on the first byte of the file.

So now you can do this and it should work:

    db.load(r, "BUFR")

edigiacomo commented 9 years ago

Thank you! But when encoding is None, then BUFR encoding is set (the docstring should be updated)

spanezz commented 9 years ago

Should be fixed now. Rather than documenting that BUFR is set when encoding is None, I fixed the behaviour so that autodetect is attempted when encoding is None

ARPA-SIMC / dballe

Python db.load raise exception when loading from urlopen() #13