Closed randyzwitch closed 10 years ago
Are you tracking ODBC master? (i.e. what's the printout of Pkg.status()
?) The UTF16 package was recently merged into Base, so I've been trying to tweak to make sure ODBC keeps up with that. I may have broken something along the way though.
Updated, still have the error.
julia> Pkg.status()
15 required packages:
- AWS 0.0.0
- Clustering 0.2.3
- DataFrames 0.4.1
- GLM 0.2.2
- Gadfly 0.1.26
- Homebrew 0.0.4
- IJulia 0.1.0
- LazySequences 0.1.0
- ODBC 0.3.1
- PyCall 0.2.1
- RDatasets 0.0.2
- Stats 0.2.8
- TextAnalysis 0.0.1
- Vega 0.0.0
- YAML 0.1.6
28 additional packages:
- BinDeps 0.2.12
- Blocks 0.0.0
- Calendar 0.4.0
- Codecs 0.1.0
- Color 0.2.6
- Compose 0.1.20
- DataArrays 0.0.1
- Datetime 0.1.2
- Distance 0.2.4
- Distributions 0.2.11
- GZip 0.2.5
- GeoIP 0.0.0+ master (dirty)
- HTTPClient 0.0.0
- ICU 0.2.3
- Iterators 0.1.1
- JSON 0.3.1
- Languages 0.0.1
- LibCURL 0.0.0
- LibExpat 0.0.0
- Loess 0.0.1
- Nettle 0.1.2
- NumericExtensions 0.2.18
- REPLCompletions 0.0.0
- SortingAlgorithms 0.0.1
- UAParser 0.0.0- master (unregistered)
- URIParser 0.0.0
- UTF16 0.3.0
- ZMQ 0.1.7
julia> Pkg.rm("ODBC")
INFO: Removing ODBC v0.3.1
INFO: Removing UTF16 v0.3.0
INFO: REQUIRE updated
julia> Pkg.clone("https://github.com/karbarcca/ODBC.jl.git")
INFO: Cloning ODBC from https://github.com/karbarcca/ODBC.jl.git
INFO: Computing changes...
INFO: No packages to install, update or remove
julia> Pkg.update()
INFO: Updating METADATA...
INFO: Updating ODBC...
INFO: Updating UAParser...
INFO: Computing changes...
INFO: No packages to install, update or remove
julia> Pkg.status()
14 required packages:
- AWS 0.0.0
- Clustering 0.2.3
- DataFrames 0.4.1
- GLM 0.2.2
- Gadfly 0.1.26
- Homebrew 0.0.4
- IJulia 0.1.0
- LazySequences 0.1.0
- PyCall 0.2.1
- RDatasets 0.0.2
- Stats 0.2.8
- TextAnalysis 0.0.1
- Vega 0.0.0
- YAML 0.1.6
28 additional packages:
- BinDeps 0.2.12
- Blocks 0.0.0
- Calendar 0.4.0
- Codecs 0.1.0
- Color 0.2.6
- Compose 0.1.20
- DataArrays 0.0.1
- Datetime 0.1.2
- Distance 0.2.4
- Distributions 0.2.11
- GZip 0.2.5
- GeoIP 0.0.0+ master (dirty)
- HTTPClient 0.0.0
- ICU 0.2.3
- Iterators 0.1.1
- JSON 0.3.1
- Languages 0.0.1
- LibCURL 0.0.0
- LibExpat 0.0.0
- Loess 0.0.1
- Nettle 0.1.2
- NumericExtensions 0.2.18
- ODBC 0.3.1+ master
- REPLCompletions 0.0.0
- SortingAlgorithms 0.0.1
- UAParser 0.0.0- master (unregistered)
- URIParser 0.0.0
- ZMQ 0.1.7
Sorry to be slow on this Randy. Yeah, this is a tricky issue because without knowing exactly what the characters are that are causing the invalid sequence it's hard to tell if it's the ODBC problem or a data quality problem. Anyway, I pushed a fix that hopefully doesn't barf when returning results. In my limited testing, I get the \ufffd
"replacement" character when trying to convert invalid characters to UTF-8. Let me know how it looks/works for you.
Sorry to be the bearer of bad news. It seems like this change has broken the results allocation fix in issue 26. I'll put some detail in that issue.
Let me know if you see the fix @randyzwitch, I've pushed quite a few changes recently to speed up/fix the result fetching process, so hopefully things are better now.
Thanks for taking a look at this Jacob, no worries on the timing! I'll play around with this and let you know what I find.
I got an error:
julia> ua_df = query("Select distinct ua from leads limit 1000000;");
ERROR: no method unsafe_copy!(Ptr{UTF8String}, Ptr{Uint32}, Int64)
in ODBCCopy! at /Users/randyzwitch/.julia/ODBC/src/backend.jl:129
in ODBCFetchDataFrame at /Users/randyzwitch/.julia/ODBC/src/backend.jl:147
in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:57
in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:43
I'm using 0.3.3 master
- ODBC 0.3.3 master
Doh! I had taken out the UTF16 and UTF32 support while I was working on stuff and forgot to put them back in. I just pushed them back in, so if you're on master, a Pkg.update()
should bring in the changes. If everything looks good, I'll bump the Metadata version again.
New error this time
julia> ua_df = query("Select ua from leads limit 1000000;");
ERROR: BoundsError()
in checkbounds at abstractarray.jl:76
in checkbounds at abstractarray.jl:96
in getindex at array.jl:312
in ODBCCopy! at /Users/randyzwitch/.julia/ODBC/src/backend.jl:147
in ODBCFetchDataFrame at /Users/randyzwitch/.julia/ODBC/src/backend.jl:147
in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:57
in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:43
Hmmm......this one is stumping me, so we'll probably have to dig deeper. By chance though, I just pushed some other fixes that changed the syntax a bit, so let me know if you still see this. If so, I'll dig up some more detailed logging so we can get to the bottom of this. Sorry this is a hassle, but I don't know of a way to deal with UTF32 strings on my machine, so unfortunately you have to be my guinea pig for trying different ideas.
No problem at all, glad to help out any way I can.
Unfortunately, still the same error this morning after updating master.
julia> ua_df = query("select distinct ua from leads limit 1000000;");
ERROR: BoundsError()
in checkbounds at abstractarray.jl:76
in checkbounds at abstractarray.jl:96
in getindex at array.jl:312
in ODBCCopy! at /Users/randyzwitch/.julia/ODBC/src/backend.jl:147
in ODBCFetchDataFrame at /Users/randyzwitch/.julia/ODBC/src/backend.jl:147
in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:57
in query at /Users/randyzwitch/.julia/ODBC/src/userfacing.jl:43
Alright, I think I may have figured it out. I found out a way to return data as UTF16 from my database so I could actually do some testing. Here's rundown of what was going (if not for future reference):
-I've been doing a lot of work lately on speeding up the resultset fetching process -Part of this involved using information returned from the database with each fetch, including what I thought was the # of characters per string -Unfortunately, it isn't the # of characters, it's the number of bytes, so that's why we were seeing the BoundsErrors() for UTF16 and UTF32 (at least I hope)
Anyway, hopefully this does it, let me know how it looks for you.
Winner!
And btw, the work you've done on making the resultset faster has paid off. I don't have a benchmark on your prior code, but I ran this query:
select distinct ua from leads limit 1000000;
Julia processed the answer in 12.33 seconds (no Redshift caching). SQL Workbench (open source querying tool based on JVM) used at least 2m57s after caching (the timer stops when displaying, so I don't know the real timing). Additionally, Julia uses WAY less RAM to process the records; 400 MB or so for Julia, at least 1.12GB for SQL Workbench.
Awesome job @karbarcca!
Great news! Glad it's working.
Yeah, I haven't really done much in terms of performance work, because it's been mainly a work of just getting everything to work (cross-platform!). Since things settled down a little, I just needed a push to work on it and #27 was the push. As I've dug into it, it's been pretty fascinating how easy it is to get good performance. The key is Julia's compatibility with C. Being able to map types so easily between the two as well as calling the API functions directly results in a very seamless integration; I would doubt any other language/application that's not C will probably have a much harder time getting such good performance (and not use as much memory as well).
Anyway, thanks for your help with this. Cheers.
That's scary that there are potentially more improvements that can be made (if you haven't really worked on performance tuning). I ran the same query bumping to 10MM results, it only took 300s and SQL Workbench still hasn't returned the prior query results from before...
Yeah, I'm not sure how much more there really is in terms of performance, but I think we're getting reasonably close. My goal for the performance tuning was avoiding copying data at all costs, which is the biggest pitfall I think. It will be interesting to see how the array views development works out because that could give some more speed bumps as we'll avoid the copying of getindex as well. I also think there's something a little more efficient than what we're currently doing with strings, but my post in the dev forum hasn't gotten any response.
Sorry, today is cause @karbarcca heartburn day...
I'm getting an error about UTF-8 errors; I suspect there is malformed data in this field, but it's causing Julia to choke: