Open drizk1 opened 5 months ago
Hey @kdpsingh I know you're super busy, but real quick, I've been going on a bug finding mission trying to use increasingly complex queries to find bugs. I've been able to replicate nearly all of the answers on this site I found, and I now have a postgres parser that I have been testing here and it works (with ability to switch back and forth) but I was wondering if you might have an idea of place with more complex queries that I can use to further bug search and test.
thank you!
here as an example of one of the more involved queries ive been able to replicate.
@chain start_query_meta(db, :athlete_events) begin
@rename(tf = NOC)
@mutateGold = if_else(Medal == "Gold", 1, 0),
Silver = if_else(Medal == "Silver", 1, 0),
Bronze = if_else(Medal == "Bronze", 1, 0))
@mutate(total_medals = Gold + Silver + Bronze)
@left_join(:noc_regions, NOC, tf)
@group_by(region, Sport)
@summarize(total_sum = sum(total_medals))
@filter(region == "India" )
@slice_max(total_sum)
# @show_query
@collect
end
WITH cte_1 AS (
SELECT ID, Name, Sex, Age, Height, Weight, Team, NOC AS tf, Games, Year, Season, City, Sport, Event, Medal
FROM athlete_events),
cte_3 AS (
SELECT ID, Name, Sex, Age, Height, Weight, Team, tf, Games, Year, Season, City, Sport, Event, Medal, CASE WHEN Medal = 'Gold' THEN 1 ELSE 0 END AS Gold, CASE WHEN Medal = 'Silver' THEN 1 ELSE 0 END AS Silver, CASE WHEN Medal = 'Bronze' THEN 1 ELSE 0 END AS Bronze
FROM cte_1),
cte_5 AS (
SELECT ID, Name, Sex, Age, Height, Weight, Team, tf, Games, Year, Season, City, Sport, Event, Medal, Gold, Silver, Bronze, Gold + Silver + Bronze AS total_medals
FROM cte_3),
cte_6 AS (
SELECT cte_5.*, noc_regions.*
FROM cte_5
LEFT JOIN noc_regions ON noc_regions.NOC = cte_5.tf),
cte_7 AS (
SELECT region, Sport, SUM(total_medals) AS total_sum
FROM cte_6
GROUP BY region, Sport
HAVING region = 'India'),
cte_8 AS (
SELECT *, RANK() OVER (
ORDER BY total_sum DESC) AS rank_col
FROM cte_7),
cte_9 AS (
SELECT *
FROM cte_8
WHERE rank_col <= 1)
SELECT *
FROM cte_9
1×4 DataFrame
Row │ region Sport total_sum rank_col
│ String String Int64 Int64
─────┼─────────────────────────────────────
1 │ India Hockey 173 1
This looks amazing!!! Congrats on all the progress. Will give this a look soon and see where we should go next. My sense is that all the ingredients are here to packagify this.
Thank you! Sounds great. I'm open to any direction. This has just been fun to try building
I'll work on getting docstrings / documentation sorted
just an fyi so u hopefully don't duplicate work and can avoid the joy of docstrings, but I went ahead and got the docstrings actually running here ( i had to get rid of all the show_query examples and use collect for most examples because of hidden space characters i could not find)
Thank you for moving this to TidierOrg - was just about to suggest this.
I have an idea for how to use this alongside TidierData for now that doesn't require a big rewrite of TidierData.
Of course. I'm curious what that will look like, because avoiding a rewrite of either would be nice.
I tried a few things to no avail for getting them to dispatch nicely together. (Ie creating a third macro that auto detects and then dispatches the correct version. )
I know it's not ideal but I'd be open to having it be @mutate.db
for example so that way TidierDB and TidierData could be easily used in the same chains.
Here's what I would suggest. When we use TidierDB with TidierData, we shouldn't do using TidierDB
. Instead, we should do import TidierDB as DB
.
Then, all of the TidierDB macros will be available as DB.@mutate
and so on, and the TidierData equivalent would be @mutate
.
Also, can we rename start_query_meta()
to db_table()
since it's pointing to a table in a database?
Yes absolutely I've been wanting to rename that because it was too long.
I love that import method. And then it would allow TidierData to be called as normal too? I think that's great
Yes, TidierData would work as normal so could use them together.
Awesome. I think one of the last things then is getting the docs to build. I have all the files in place, but I'm not sure how to enable it to build
I haven't gotten the chance to play with it yet. Did you have to install SQLite and/or DuckDB to get the code to work? Or was installing the relevant Julia packages for those sufficient?
Just thinking about what the issue may be.
If you can't get it working, I'll look at it tonight or tomorrow.
Simpler question: do the docs build locally?
I think they build locally. I changed the default to duckdb instead of SQLite and export 2 functions for opening the db and connecting. (Using other databases means the user has to write in using using clickhouse
)
I think this has something to do with the package set up (ie git actions and I didn't set up for documenter. ) because there are only 2 green checks not 3 like other packages when pushing the new commits
Ok let me check on this - will get to it tonight or tomorrow at the latest.
Once the docs build, we can get this on the way to the registry.
Also I would go ahead and add the Arrow dependency to make it work with DuckDB.
I'm also considering allowing TidierData to work directly with Arrow but that will require a bit of work to ensure that we don't fully instantiate Arrow tables as DataFrames.
This way via DuckDB might be a nice alternative.
Sweet. I'll get that added a little later tn and change the name to db_table
I realized The documenter.yml just needs to be added and then it should work. So I should be able to get both this and tidierfiles working tn
There's one setting you'll have to change in Settings to make the documentation actually show up. If tests pass, happy to help.
So i have reached a bit of an impasse.
I have been unable to get the docs to build and continue getting this error error
I have tried various things including making it so no code runs because everything is a comment in the docs, and no matter what i get this error. I am wondering if its something you have ever experienced?
ERROR: LoadError: AssertionError: length(exs) == 2 * length(results)
Stacktrace:
[1] parseblock(code::String, doc::Documenter.Documents.Document, file::String; skip::Int64, keywords::Bool, raise::Bool, linenumbernode::LineNumberNode)
@ Documenter.Utilities ~/.julia/packages/Documenter/bYYzK/src/Utilities/Utilities.jl:156
[2] parseblock
@ ~/.julia/packages/Documenter/bYYzK/src/Utilities/Utilities.jl:123 [inlined]
[3] #parseblock#1
@ ~/.julia/packages/Documenter/bYYzK/src/Documents.jl:72 [inlined]
[4] runner(::Type{Documenter.Expanders.ExampleBlocks}, x::Markdown.Code, page::Documenter.Documents.Page, doc::Documenter.Documents.Document)
@ Documenter.Expanders ~/.julia/packages/Documenter/bYYzK/src/Expanders.jl:625
[5] dispatch(::Type{Documenter.Expanders.ExpanderPipeline}, ::Markdown.Code, ::Vararg{Any})
@ Documenter.Utilities.Selectors ~/.julia/packages/Documenter/bYYzK/src/Utilities/Selectors.jl:170
[6] expand(doc::Documenter.Documents.Document)
@ Documenter.Expanders ~/.julia/packages/Documenter/bYYzK/src/Expanders.jl:42
[7] runner(::Type{Documenter.Builder.ExpandTemplates}, doc::Documenter.Documents.Document)
@ Documenter.Builder ~/.julia/packages/Documenter/bYYzK/src/Builder.jl:226
[8] dispatch(::Type{Documenter.Builder.DocumentPipeline}, x::Documenter.Documents.Document)
@ Documenter.Utilities.Selectors ~/.julia/packages/Documenter/bYYzK/src/Utilities/Selectors.jl:170
[9] #2
@ ~/.julia/packages/Documenter/bYYzK/src/Documenter.jl:273 [inlined]
[10] cd(f::Documenter.var"#2#3"{Documenter.Documents.Document}, dir::String)
@ Base.Filesystem ./file.jl:112
[11] #makedocs#1
@ ~/.julia/packages/Documenter/bYYzK/src/Documenter.jl:272 [inlined]
[12] top-level scope
@ ~/work/TidierDB.jl/TidierDB.jl/docs/make.jl:11
I haven't experienced this exact error but let me investigate today.
sounds good, the docs seem to run locally on my end so curious what the issue is.
julia> include("make.jl")
┌ Warning: DocTestSetup already set for module TidierDB. Overwriting.
└ @ Documenter.DocMeta ~/.julia/packages/Documenter/bYYzK/src/DocMeta.jl:81
[ Info: SetupBuildDirectory: setting up build directory.
[ Info: Doctest: skipped.
[ Info: ExpandTemplates: expanding markdown templates.
[ Info: CrossReferences: building cross-references.
[ Info: CheckDocument: running document checks.
[ Info: Populate: populating indices.
[ Info: RenderDocument: rendering document.
[ Info: MarkdownWriter: rendering Markdown pages.
┌ Warning: Documenter could not auto-detect the building environment Skipping deployment.
└ @ Documenter ~/.julia/packages/Documenter/bYYzK/src/deployconfig.jl:75
LoadError: AssertionError: length(exs) == 2 * length(results)
I've had this error - it seems to happen when there is an example that does not output anything
That's exactly what the issue was. Now fixed.
This is a map as guide for what is to come and what others might want to contribute if they feel so inclined.
Macros
mutate
select
arrange
distinct
filter
slice_min
,slice_max
,@slice_sample
group_by
count
summarize
/summarise
left_join
- will try to sort out syntax differences from tidier.jl.right_join
inner_join
rename
window_order
window_frame
relocate
fill
pivot_longer
andpivot_wider
collect
show_query
Functions
expr_to_sql
to useMacrotools.postwalk
start_query_meta
- allows query to begin and stores metadata to enable tidy selectioncase_when
,if_else
starts_with
,ends_with
,contains
,str_detect
is_missing()
replace_missing
andmissing_if
as_integer
,as_float
,as_string
MAX
,MIN
,SUM
,AVG
,COUNT
asn()
insummarize
across
insummarize
across
inmutate
For Postgres/DuckDB specific parser
year
,month
,day
,hour
,min
,second
,floor_date
difftime
needed for postgres and mysqlstr_detect
,str_replace/_all
,str_remove/_all
corr
,cov
,std
,var
collect
/start_query_meta
for postgres/duckdbcopy_to
added support for SQLite and DuckDBconnect()
For duckdb directly read in paths/ https without local intermediary
missing something? drop a comment/new issue
Backends currently supported:
DuckDB
SQLite
MySQL (MariaDB supported when using MySQL.jl too)
Postgres
MSSQL
Clickhouse
Google Big Query
AWS
Oracle
Snowflake
databricks Up next:
Oracle via Oracle.jl? Vs ODBC.jl this discussion suggests it could work for both this and redshift. But need to test
Hive/Impala?
Druid? Druid.jl library hasnt changed much in years..but it should work?no current or intended maintenance per the author. Could be done thru rest api if interest is thereRedshift? it looks like libpq may enable redshift connecting so this could be pretty straightforward vs ODBC.jl which for either case would mean just adding a new parser
Spark would be nice .. but using Spark.jl requires folks to add things from terminal and may lead to build issues for the general user.. so im a bit apprehensive to add it as a dependency