Roadmap to whats next - Githubissues

drizk1 commented 5 months ago

This is a map as guide for what is to come and what others might want to contribute if they feel so inclined.

Macros

[x] mutate
[x] select
[x] arrange
[x] distinct
[x] filter
- autodetects when to use WHERE vs HAVING, even when receiving mixed multiple conditions that are not all having/where
[x] slice_min , slice_max, @slice_sample
[x] group_by
[x] count
[x] summarize / summarise
[x] left_join - will try to sort out syntax differences from tidier.jl.
[x] right_join
[x] inner_join
[x] rename
[x] window_order
[x] window_frame
[ ] relocate
[ ] fill
[ ] pivot_longer and pivot_wider
- pivots are going to be challenging. Unclear how to access values when data is in a database to build case whens...
[x] collect
[x] show_query

Functions

[x] rewrite sql parsing expr_to_sql to use Macrotools.postwalk
- realized @capture can't recognize function names w _ so i made a workaround function for this bug, still a few holdover parses to convert
[x] start_query_meta - allows query to begin and stores metadata to enable tidy selection
[x] case_when, if_else
[x] starts_with, ends_with, contains, str_detect
[x] is_missing()
[x] replace_missing and missing_if
[x] as_integer, as_float, as_string
[x] SQL aggregate functions: MAX, MIN, SUM, AVG, COUNT as n() in summarize
[x] universalize metadata df so that it works for postgres, sqlite and others as they get added.
[x] across in summarize
[x] across in mutate

For Postgres/DuckDB specific parser

[x] dates functions (lubridate)
- done: year, month, day, hour, min, second, floor_date
- difftime needed for postgres and mysql
[x] stringr functions
- done: str_detect, str_replace/_all, str_remove/_all
[x] stats aggregate functions
- done: corr, cov, std, var
[x] collect/start_query_meta for postgres/duckdb
[x] Fix MySQL date parsing Features
[x] tidy selection (works with select, group_by, distinct, across inside of summarize. Mutate and summarize add new columns and update column tracking.)
[x] documentation
[x] interpolation - so i fixed it again to avoid using eval, but it just requires a hidden context and an add function. it works, but its a few extra keystrokes for the user.
- implemented in most macros. works for vectors of variables as strings, or individual variables as strings, as well as values and inside of across. Need to add the escaped of tidier data
[x] docstrings
[x] copy_to added support for SQLite and DuckDB
[x] other connect functions?
- this is now done as connect()

For duckdb directly read in paths/ https without local intermediary

[x] .csv
[x] .parquet
[x] .json
[x] .arrow

missing something? drop a comment/new issue

Backends currently supported:

DuckDB
SQLite
MySQL (MariaDB supported when using MySQL.jl too)
Postgres
MSSQL
Clickhouse
Google Big Query
AWS
Oracle
Snowflake
databricks Up next:
Oracle via Oracle.jl? Vs ODBC.jl this discussion suggests it could work for both this and redshift. But need to test
Hive/Impala?
~~Druid? Druid.jl library hasnt changed much in years..but it should work?~~ no current or intended maintenance per the author. Could be done thru rest api if interest is there
Redshift? it looks like libpq may enable redshift connecting so this could be pretty straightforward vs ODBC.jl which for either case would mean just adding a new parser
Spark would be nice .. but using Spark.jl requires folks to add things from terminal and may lead to build issues for the general user.. so im a bit apprehensive to add it as a dependency

drizk1 commented 4 months ago

Hey @kdpsingh I know you're super busy, but real quick, I've been going on a bug finding mission trying to use increasingly complex queries to find bugs. I've been able to replicate nearly all of the answers on this site I found, and I now have a postgres parser that I have been testing here and it works (with ability to switch back and forth) but I was wondering if you might have an idea of place with more complex queries that I can use to further bug search and test.

thank you!

here as an example of one of the more involved queries ive been able to replicate.

@chain start_query_meta(db, :athlete_events) begin
    @rename(tf = NOC)
    @mutateGold = if_else(Medal == "Gold", 1, 0),
               Silver = if_else(Medal == "Silver", 1, 0),
               Bronze = if_else(Medal == "Bronze", 1, 0))
    @mutate(total_medals = Gold + Silver + Bronze)
    @left_join(:noc_regions, NOC, tf)
    @group_by(region, Sport)
    @summarize(total_sum = sum(total_medals))
    @filter(region == "India" )
    @slice_max(total_sum)
   # @show_query
    @collect
end

WITH cte_1 AS (
SELECT ID, Name, Sex, Age, Height, Weight, Team, NOC AS tf, Games, Year, Season, City, Sport, Event, Medal
        FROM athlete_events),
cte_3 AS (
SELECT  ID, Name, Sex, Age, Height, Weight, Team, tf, Games, Year, Season, City, Sport, Event, Medal, CASE WHEN Medal = 'Gold' THEN 1 ELSE 0 END AS Gold, CASE WHEN Medal = 'Silver' THEN 1 ELSE 0 END AS Silver, CASE WHEN Medal = 'Bronze' THEN 1 ELSE 0 END AS Bronze
        FROM cte_1),
cte_5 AS (
SELECT  ID, Name, Sex, Age, Height, Weight, Team, tf, Games, Year, Season, City, Sport, Event, Medal, Gold, Silver, Bronze, Gold + Silver + Bronze AS total_medals
        FROM cte_3),
cte_6 AS (
SELECT  cte_5.*, noc_regions.*
        FROM cte_5
        LEFT JOIN noc_regions ON noc_regions.NOC = cte_5.tf),
cte_7 AS (
SELECT region, Sport, SUM(total_medals) AS total_sum
        FROM cte_6
        GROUP BY region, Sport 
        HAVING region = 'India'),
cte_8 AS (
SELECT *, RANK() OVER (
        ORDER BY total_sum DESC) AS rank_col
        FROM cte_7),
cte_9 AS (
SELECT  *
        FROM cte_8
        WHERE rank_col <= 1)  
SELECT *
        FROM cte_9

1×4 DataFrame
 Row │ region  Sport   total_sum  rank_col 
     │ String  String  Int64      Int64    
─────┼─────────────────────────────────────
   1 │ India   Hockey        173         1

kdpsingh commented 4 months ago

This looks amazing!!! Congrats on all the progress. Will give this a look soon and see where we should go next. My sense is that all the ingredients are here to packagify this.

drizk1 commented 4 months ago

Thank you! Sounds great. I'm open to any direction. This has just been fun to try building

I'll work on getting docstrings / documentation sorted

drizk1 commented 3 months ago

just an fyi so u hopefully don't duplicate work and can avoid the joy of docstrings, but I went ahead and got the docstrings actually running here ( i had to get rid of all the show_query examples and use collect for most examples because of hidden space characters i could not find)

kdpsingh commented 3 months ago

Thank you for moving this to TidierOrg - was just about to suggest this.

I have an idea for how to use this alongside TidierData for now that doesn't require a big rewrite of TidierData.

drizk1 commented 3 months ago

Of course. I'm curious what that will look like, because avoiding a rewrite of either would be nice.

I tried a few things to no avail for getting them to dispatch nicely together. (Ie creating a third macro that auto detects and then dispatches the correct version. )

I know it's not ideal but I'd be open to having it be @mutate.db for example so that way TidierDB and TidierData could be easily used in the same chains.

kdpsingh commented 3 months ago

Here's what I would suggest. When we use TidierDB with TidierData, we shouldn't do using TidierDB. Instead, we should do import TidierDB as DB.

Then, all of the TidierDB macros will be available as DB.@mutate and so on, and the TidierData equivalent would be @mutate.

kdpsingh commented 3 months ago

Also, can we rename start_query_meta() to db_table() since it's pointing to a table in a database?

drizk1 commented 3 months ago

Yes absolutely I've been wanting to rename that because it was too long.

I love that import method. And then it would allow TidierData to be called as normal too? I think that's great

kdpsingh commented 3 months ago

Yes, TidierData would work as normal so could use them together.

drizk1 commented 3 months ago

Awesome. I think one of the last things then is getting the docs to build. I have all the files in place, but I'm not sure how to enable it to build

kdpsingh commented 3 months ago

I haven't gotten the chance to play with it yet. Did you have to install SQLite and/or DuckDB to get the code to work? Or was installing the relevant Julia packages for those sufficient?

Just thinking about what the issue may be.

If you can't get it working, I'll look at it tonight or tomorrow.

kdpsingh commented 3 months ago

Simpler question: do the docs build locally?

drizk1 commented 3 months ago

I think they build locally. I changed the default to duckdb instead of SQLite and export 2 functions for opening the db and connecting. (Using other databases means the user has to write in using using clickhouse)

I think this has something to do with the package set up (ie git actions and I didn't set up for documenter. ) because there are only 2 green checks not 3 like other packages when pushing the new commits

kdpsingh commented 3 months ago

Ok let me check on this - will get to it tonight or tomorrow at the latest.

Once the docs build, we can get this on the way to the registry.

kdpsingh commented 3 months ago

Also I would go ahead and add the Arrow dependency to make it work with DuckDB.

kdpsingh commented 3 months ago

I'm also considering allowing TidierData to work directly with Arrow but that will require a bit of work to ensure that we don't fully instantiate Arrow tables as DataFrames.

This way via DuckDB might be a nice alternative.

drizk1 commented 3 months ago

Sweet. I'll get that added a little later tn and change the name to db_table

drizk1 commented 3 months ago

I realized The documenter.yml just needs to be added and then it should work. So I should be able to get both this and tidierfiles working tn

kdpsingh commented 3 months ago

There's one setting you'll have to change in Settings to make the documentation actually show up. If tests pass, happy to help.

drizk1 commented 3 months ago

So i have reached a bit of an impasse.

I have been unable to get the docs to build and continue getting this error error

I have tried various things including making it so no code runs because everything is a comment in the docs, and no matter what i get this error. I am wondering if its something you have ever experienced?

ERROR: LoadError: AssertionError: length(exs) == 2 * length(results)
Stacktrace:
  [1] parseblock(code::String, doc::Documenter.Documents.Document, file::String; skip::Int64, keywords::Bool, raise::Bool, linenumbernode::LineNumberNode)
    @ Documenter.Utilities ~/.julia/packages/Documenter/bYYzK/src/Utilities/Utilities.jl:156
  [2] parseblock
    @ ~/.julia/packages/Documenter/bYYzK/src/Utilities/Utilities.jl:123 [inlined]
  [3] #parseblock#1
    @ ~/.julia/packages/Documenter/bYYzK/src/Documents.jl:72 [inlined]
  [4] runner(::Type{Documenter.Expanders.ExampleBlocks}, x::Markdown.Code, page::Documenter.Documents.Page, doc::Documenter.Documents.Document)
    @ Documenter.Expanders ~/.julia/packages/Documenter/bYYzK/src/Expanders.jl:625
  [5] dispatch(::Type{Documenter.Expanders.ExpanderPipeline}, ::Markdown.Code, ::Vararg{Any})
    @ Documenter.Utilities.Selectors ~/.julia/packages/Documenter/bYYzK/src/Utilities/Selectors.jl:170
  [6] expand(doc::Documenter.Documents.Document)
    @ Documenter.Expanders ~/.julia/packages/Documenter/bYYzK/src/Expanders.jl:42
  [7] runner(::Type{Documenter.Builder.ExpandTemplates}, doc::Documenter.Documents.Document)
    @ Documenter.Builder ~/.julia/packages/Documenter/bYYzK/src/Builder.jl:226
  [8] dispatch(::Type{Documenter.Builder.DocumentPipeline}, x::Documenter.Documents.Document)
    @ Documenter.Utilities.Selectors ~/.julia/packages/Documenter/bYYzK/src/Utilities/Selectors.jl:170
  [9] #2
    @ ~/.julia/packages/Documenter/bYYzK/src/Documenter.jl:273 [inlined]
 [10] cd(f::Documenter.var"#2#3"{Documenter.Documents.Document}, dir::String)
    @ Base.Filesystem ./file.jl:112
 [11] #makedocs#1
    @ ~/.julia/packages/Documenter/bYYzK/src/Documenter.jl:272 [inlined]
 [12] top-level scope
    @ ~/work/TidierDB.jl/TidierDB.jl/docs/make.jl:11

kdpsingh commented 3 months ago

I haven't experienced this exact error but let me investigate today.

drizk1 commented 3 months ago

sounds good, the docs seem to run locally on my end so curious what the issue is.

julia> include("make.jl")
┌ Warning: DocTestSetup already set for module TidierDB. Overwriting.
└ @ Documenter.DocMeta ~/.julia/packages/Documenter/bYYzK/src/DocMeta.jl:81
[ Info: SetupBuildDirectory: setting up build directory.
[ Info: Doctest: skipped.
[ Info: ExpandTemplates: expanding markdown templates.
[ Info: CrossReferences: building cross-references.
[ Info: CheckDocument: running document checks.
[ Info: Populate: populating indices.
[ Info: RenderDocument: rendering document.
[ Info: MarkdownWriter: rendering Markdown pages.
┌ Warning: Documenter could not auto-detect the building environment Skipping deployment.
└ @ Documenter ~/.julia/packages/Documenter/bYYzK/src/deployconfig.jl:75

rdboyes commented 3 months ago

LoadError: AssertionError: length(exs) == 2 * length(results)

I've had this error - it seems to happen when there is an example that does not output anything

kdpsingh commented 3 months ago

That's exactly what the issue was. Now fixed.

TidierOrg / TidierDB.jl

Roadmap to whats next #1