JuliaDatabases / ODBC.jl

An ODBC interface for the Julia programming language
https://odbc.juliadatabases.org/stable
Other
106 stars 63 forks source link

invalid UTF-8 sequence #25

Closed randyzwitch closed 10 years ago

randyzwitch commented 10 years ago

Sorry, today is cause @karbarcca heartburn day...

I'm getting an error about UTF-8 errors; I suspect there is malformed data in this field, but it's causing Julia to choke:

#Get lots of user-agent strings
ua_df = query("Select distinct ua from leads limit 1000000;");

invalid UTF-8 sequence
while loading In[2], in expression starting on line 2
 in convert at utf8.jl:155
 in ODBCClean at /Users/randyzwitch/.julia/ODBC/src/backend.jl:119
 in ODBCFetchDataFrame at /Users/randyzwitch/.julia/ODBC/src/backend.jl:135
 in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:57
 in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:43

querymeta("Select distinct ua from leads limit 1000000;")

Resultset metadata for executed query
------------------------------------
Query: Select distinct ua from leads limit 1000000;Columns: 1
Rows: 1000000
1x5 DataFrame:
        Column Names                    Types Sizes Digits Nullable
[1,]            "ua" ("SQL_WLONGVARCHAR",-10)   512      0        1
quinnj commented 10 years ago

Are you tracking ODBC master? (i.e. what's the printout of Pkg.status()?) The UTF16 package was recently merged into Base, so I've been trying to tweak to make sure ODBC keeps up with that. I may have broken something along the way though.

randyzwitch commented 10 years ago

Updated, still have the error.

julia> Pkg.status()
15 required packages:
 - AWS                           0.0.0
 - Clustering                    0.2.3
 - DataFrames                    0.4.1
 - GLM                           0.2.2
 - Gadfly                        0.1.26
 - Homebrew                      0.0.4
 - IJulia                        0.1.0
 - LazySequences                 0.1.0
 - ODBC                          0.3.1
 - PyCall                        0.2.1
 - RDatasets                     0.0.2
 - Stats                         0.2.8
 - TextAnalysis                  0.0.1
 - Vega                          0.0.0
 - YAML                          0.1.6
28 additional packages:
 - BinDeps                       0.2.12
 - Blocks                        0.0.0
 - Calendar                      0.4.0
 - Codecs                        0.1.0
 - Color                         0.2.6
 - Compose                       0.1.20
 - DataArrays                    0.0.1
 - Datetime                      0.1.2
 - Distance                      0.2.4
 - Distributions                 0.2.11
 - GZip                          0.2.5
 - GeoIP                         0.0.0+             master (dirty)
 - HTTPClient                    0.0.0
 - ICU                           0.2.3
 - Iterators                     0.1.1
 - JSON                          0.3.1
 - Languages                     0.0.1
 - LibCURL                       0.0.0
 - LibExpat                      0.0.0
 - Loess                         0.0.1
 - Nettle                        0.1.2
 - NumericExtensions             0.2.18
 - REPLCompletions               0.0.0
 - SortingAlgorithms             0.0.1
 - UAParser                      0.0.0-             master (unregistered)
 - URIParser                     0.0.0
 - UTF16                         0.3.0
 - ZMQ                           0.1.7

julia> Pkg.rm("ODBC")
INFO: Removing ODBC v0.3.1
INFO: Removing UTF16 v0.3.0
INFO: REQUIRE updated

julia> Pkg.clone("https://github.com/karbarcca/ODBC.jl.git")
INFO: Cloning ODBC from https://github.com/karbarcca/ODBC.jl.git
INFO: Computing changes...
INFO: No packages to install, update or remove

julia> Pkg.update()
INFO: Updating METADATA...
INFO: Updating ODBC...
INFO: Updating UAParser...
INFO: Computing changes...
INFO: No packages to install, update or remove

julia> Pkg.status()
14 required packages:
 - AWS                           0.0.0
 - Clustering                    0.2.3
 - DataFrames                    0.4.1
 - GLM                           0.2.2
 - Gadfly                        0.1.26
 - Homebrew                      0.0.4
 - IJulia                        0.1.0
 - LazySequences                 0.1.0
 - PyCall                        0.2.1
 - RDatasets                     0.0.2
 - Stats                         0.2.8
 - TextAnalysis                  0.0.1
 - Vega                          0.0.0
 - YAML                          0.1.6
28 additional packages:
 - BinDeps                       0.2.12
 - Blocks                        0.0.0
 - Calendar                      0.4.0
 - Codecs                        0.1.0
 - Color                         0.2.6
 - Compose                       0.1.20
 - DataArrays                    0.0.1
 - Datetime                      0.1.2
 - Distance                      0.2.4
 - Distributions                 0.2.11
 - GZip                          0.2.5
 - GeoIP                         0.0.0+             master (dirty)
 - HTTPClient                    0.0.0
 - ICU                           0.2.3
 - Iterators                     0.1.1
 - JSON                          0.3.1
 - Languages                     0.0.1
 - LibCURL                       0.0.0
 - LibExpat                      0.0.0
 - Loess                         0.0.1
 - Nettle                        0.1.2
 - NumericExtensions             0.2.18
 - ODBC                          0.3.1+             master
 - REPLCompletions               0.0.0
 - SortingAlgorithms             0.0.1
 - UAParser                      0.0.0-             master (unregistered)
 - URIParser                     0.0.0
 - ZMQ                           0.1.7
quinnj commented 10 years ago

Sorry to be slow on this Randy. Yeah, this is a tricky issue because without knowing exactly what the characters are that are causing the invalid sequence it's hard to tell if it's the ODBC problem or a data quality problem. Anyway, I pushed a fix that hopefully doesn't barf when returning results. In my limited testing, I get the \ufffd "replacement" character when trying to convert invalid characters to UTF-8. Let me know how it looks/works for you.

ghost commented 10 years ago

Sorry to be the bearer of bad news. It seems like this change has broken the results allocation fix in issue 26. I'll put some detail in that issue.

quinnj commented 10 years ago

Let me know if you see the fix @randyzwitch, I've pushed quite a few changes recently to speed up/fix the result fetching process, so hopefully things are better now.

randyzwitch commented 10 years ago

Thanks for taking a look at this Jacob, no worries on the timing! I'll play around with this and let you know what I find.

randyzwitch commented 10 years ago

I got an error:

julia> ua_df = query("Select distinct ua from leads limit 1000000;");
ERROR: no method unsafe_copy!(Ptr{UTF8String}, Ptr{Uint32}, Int64)
 in ODBCCopy! at /Users/randyzwitch/.julia/ODBC/src/backend.jl:129
 in ODBCFetchDataFrame at /Users/randyzwitch/.julia/ODBC/src/backend.jl:147
 in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:57
 in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:43

I'm using 0.3.3 master

 - ODBC                          0.3.3              master
quinnj commented 10 years ago

Doh! I had taken out the UTF16 and UTF32 support while I was working on stuff and forgot to put them back in. I just pushed them back in, so if you're on master, a Pkg.update() should bring in the changes. If everything looks good, I'll bump the Metadata version again.

randyzwitch commented 10 years ago

New error this time


julia> ua_df = query("Select ua from leads limit 1000000;");
ERROR: BoundsError()
 in checkbounds at abstractarray.jl:76
 in checkbounds at abstractarray.jl:96
 in getindex at array.jl:312
 in ODBCCopy! at /Users/randyzwitch/.julia/ODBC/src/backend.jl:147
 in ODBCFetchDataFrame at /Users/randyzwitch/.julia/ODBC/src/backend.jl:147
 in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:57
 in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:43
quinnj commented 10 years ago

Hmmm......this one is stumping me, so we'll probably have to dig deeper. By chance though, I just pushed some other fixes that changed the syntax a bit, so let me know if you still see this. If so, I'll dig up some more detailed logging so we can get to the bottom of this. Sorry this is a hassle, but I don't know of a way to deal with UTF32 strings on my machine, so unfortunately you have to be my guinea pig for trying different ideas.

randyzwitch commented 10 years ago

No problem at all, glad to help out any way I can.

Unfortunately, still the same error this morning after updating master.

julia> ua_df = query("select distinct ua from leads limit 1000000;");
ERROR: BoundsError()
 in checkbounds at abstractarray.jl:76
 in checkbounds at abstractarray.jl:96
 in getindex at array.jl:312
 in ODBCCopy! at /Users/randyzwitch/.julia/ODBC/src/backend.jl:147
 in ODBCFetchDataFrame at /Users/randyzwitch/.julia/ODBC/src/backend.jl:147
 in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:57
 in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:43
quinnj commented 10 years ago

Alright, I think I may have figured it out. I found out a way to return data as UTF16 from my database so I could actually do some testing. Here's rundown of what was going (if not for future reference):

-I've been doing a lot of work lately on speeding up the resultset fetching process -Part of this involved using information returned from the database with each fetch, including what I thought was the # of characters per string -Unfortunately, it isn't the # of characters, it's the number of bytes, so that's why we were seeing the BoundsErrors() for UTF16 and UTF32 (at least I hope)

Anyway, hopefully this does it, let me know how it looks for you.

randyzwitch commented 10 years ago

Winner!

And btw, the work you've done on making the resultset faster has paid off. I don't have a benchmark on your prior code, but I ran this query:

select distinct ua from leads limit 1000000;

Julia processed the answer in 12.33 seconds (no Redshift caching). SQL Workbench (open source querying tool based on JVM) used at least 2m57s after caching (the timer stops when displaying, so I don't know the real timing). Additionally, Julia uses WAY less RAM to process the records; 400 MB or so for Julia, at least 1.12GB for SQL Workbench.

Awesome job @karbarcca!

quinnj commented 10 years ago

Great news! Glad it's working.

Yeah, I haven't really done much in terms of performance work, because it's been mainly a work of just getting everything to work (cross-platform!). Since things settled down a little, I just needed a push to work on it and #27 was the push. As I've dug into it, it's been pretty fascinating how easy it is to get good performance. The key is Julia's compatibility with C. Being able to map types so easily between the two as well as calling the API functions directly results in a very seamless integration; I would doubt any other language/application that's not C will probably have a much harder time getting such good performance (and not use as much memory as well).

Anyway, thanks for your help with this. Cheers.

randyzwitch commented 10 years ago

That's scary that there are potentially more improvements that can be made (if you haven't really worked on performance tuning). I ran the same query bumping to 10MM results, it only took 300s and SQL Workbench still hasn't returned the prior query results from before...

quinnj commented 10 years ago

Yeah, I'm not sure how much more there really is in terms of performance, but I think we're getting reasonably close. My goal for the performance tuning was avoiding copying data at all costs, which is the biggest pitfall I think. It will be interesting to see how the array views development works out because that could give some more speed bumps as we'll avoid the copying of getindex as well. I also think there's something a little more efficient than what we're currently doing with strings, but my post in the dev forum hasn't gotten any response.