Gmousse / dataframe-js

No Maintenance Intended
https://gmousse.gitbooks.io/dataframe-js/
MIT License
460 stars 38 forks source link

SQL of dataframe from csv treats all data as strings #76

Closed dwoliver closed 5 years ago

dwoliver commented 5 years ago

I am having a problem using SQL syntax to filter numerical values in a dataframe. The original dataframe is created from a csv file where the values are not represented as strings, but do appear in columns together with non-numeric values before being restructured and transposed to all numeric value columns. When I use the following SQL Tot_Travel_Time <= 11368 Results: 10234, 10689, 11368 The results indicate that 11368 is being interpreted as a string, since the range of values in the column also includes (7143,8510,8818,9295).
Likewise using Tot_Travel_Time >= 11368 Results: 7143,8510,8818,9295 Additionally, this SQL filter: dtfr.sql.request(SELECT * FROM tmp WHERE Tot_Travel_Time >= 8738 AND Tot_Travel_Time <= 11368); returns 7143,8510,8818,9295 but leaves out all values over 10,000 again indicating a string data type treatment. Another example of the problem is that these two Where clauses

Tot_Travel_Time IN (8818, 11368)
Tot_Travel_Time IN ('8818', '11368')

return the same result.
I cast the column to Number prior to registering the table df.cast('Tot_Travel_Time', Number) , but it still appears to be treated as a string. Is this a bug or I am doing something wrong?

Gmousse commented 5 years ago

Hi @dwoliver,

Thank you for your report. I will realize few tests to find the bug.

Also consider to use the js api which is far more stable and optimized than the sql module. Indeed, the sql module will be deprecated soon (by lack of time to maintain it and by lack of feedbacks / users).

Stay tuned.

dwoliver commented 5 years ago

Guillaume,

Thanks for your quick response. Could you send me an example of using the .js api for a more complex filter, with combined greater than and less than on two different columns from the dataframe? The syntax for this was not obvious to me.

Thanks,

David

On Mon, Mar 11, 2019 at 12:53 PM Guillaume Mousnier < notifications@github.com> wrote:

Hi @dwoliver https://github.com/dwoliver,

Thank you for your report. I will realize few tests to find the bug.

Also consider to use the js api which is far more stable and optimized than the sql module. Indeed, the sql module will be deprecated soon (by lack of time to maintain it and by lack of feedbacks / users).

Stay tuned.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gmousse/dataframe-js/issues/76#issuecomment-471625628, or mute the thread https://github.com/notifications/unsubscribe-auth/AQP9Px1hP1ewwe57WAGKkR2gpb7wrdMnks5vVooigaJpZM4bm7Nt .

-- David Oliver Oliver Systems, LLC 366 6th Street, #2 Brooklyn, NY 11215 718 360-8285 oliversystems.net

Gmousse commented 5 years ago

You are welcome.

I have detected the bug. It's related with the query parser which doesn't type correctly numbers when they are used in the filters. e.g: SELECT * FROM tmp WHERE column1 >= 2 AND column1 <= 1138. I work on it.

Here few examples about how use the filters: df1.filter( row => row.get("Tot_Travel_Time") >= 8738 && row.get("Tot_Travel_Time") <= 11368 ) Equivalent for Tot_Travel_Time >= 8738 AND Tot_Travel_Time <= 11368

df1.filter( row => [8818, 11368].includes(row.get("Tot_Travel_Time"))) Equivalent for Tot_Travel_Time IN (8818, 11368)

Gmousse commented 5 years ago

@dwoliver Can you test your code with the following branch ? feature/fix/issue-76

npm install git+https://github.com/Gmousse/dataframe-js.git#feature/fix/issue-76

dwoliver commented 5 years ago

Guillaume,

I am not in my office, but I will be able to try this later today (i'm in New York), and I will report back.

Thanks,

David

On Mon, Mar 11, 2019 at 1:33 PM Guillaume Mousnier notifications@github.com wrote:

@dwoliver https://github.com/dwoliver Can you test your code with the following branch ? feature/fix/issue-76

npm install git+ https://github.com/Gmousse/dataframe-js.git#feature/fix/issue-76

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gmousse/dataframe-js/issues/76#issuecomment-471642132, or mute the thread https://github.com/notifications/unsubscribe-auth/AQP9P27tld_2GIxTKzdsD-q5FWDplWbkks5vVpNagaJpZM4bm7Nt .

-- David Oliver Oliver Systems, LLC 366 6th Street, #2 Brooklyn, NY 11215 718 360-8285 oliversystems.net

Gmousse commented 5 years ago

Merged into develop. WIll be released in 1.4.0

dwoliver commented 5 years ago

Guillaume,

Thanks for your guidance on the use of the filter option. I am interested in using the function style option from your first example, but I need to be able to generate complex filters dynamically and I am not clear about how to generate the row.get() statements dynamically. This was one reason why I decided to use the SQL option, because the WHERE clause is a string and it was easy to generate complex conditions dynamically. Any guidance you could offer would be much appreciated.

David

On Mon, Mar 11, 2019 at 1:22 PM Guillaume Mousnier notifications@github.com wrote:

You are welcome.

I have detected the bug. It's related with the query parser which doesn't type correctly numbers when they are used in the filters. e.g: SELECT * FROM tmp WHERE column1 >= 2 AND column1 <= 1138. I work on it.

Here few examples about how use the filters: df1.filter( row => row.get("Tot_Travel_Time") >= 8738 && row.get("Tot_Travel_Time") <= 11368 ) Equivalent for Tot_Travel_Time >= 8738 AND Tot_Travel_Time <= 11368

df1.filter( row => [8818, 11368].includes(row.get("Tot_Travel_Time"))) Equivalent for Tot_Travel_Time IN (8818, 11368)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gmousse/dataframe-js/issues/76#issuecomment-471637646, or mute the thread https://github.com/notifications/unsubscribe-auth/AQP9Pz8xokdRqU9PA3HMOmvSpw665aoQks5vVpDUgaJpZM4bm7Nt .

-- David Oliver Oliver Systems, LLC 366 6th Street, #2 Brooklyn, NY 11215 718 360-8285 oliversystems.net

Gmousse commented 5 years ago

@dwoliver Thank you too for the feedback and the bug reporting.

Don't hesitate to send me questions (by email, on twitter or even on issues).

I will try to publish more example about complex usages in order to answer your questions.

dwoliver commented 5 years ago

Guillaume,

Here is a quick question: Continuing with my example from previous emails: Case 1: I need to filter a dataframe column for a specific value df1.filter( row => row.get("Tot_Travel_Time") = 9458)

Case 2: I need to filter a dataframe column for a range of values df1.filter( row => row.get("Tot_Travel_Time") >= 8738 && row.get("Tot_Travel_Time") <= 11368 )

How can I set up code that will be able to pass each of those different filter statements( and many others) to the same df.filter() call. When using the SQL option I was able to construct the WHERE clause conditionally and pass this dynamically generated WHERE clause string for use in the same SELECT statement in my code. This may be more of a general javascript question but it is certainly one that other users would benefit from as it related to your filter feature.

Thanks,

David

On Sat, Mar 23, 2019 at 8:52 AM Guillaume Mousnier notifications@github.com wrote:

@dwoliver https://github.com/dwoliver Thank you too for the feedback and the bug reporting.

Don't hesitate to send me questions (by email, on twitter or even on issues).

I will try to publish more example about complex usages in order to answer your questions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gmousse/dataframe-js/issues/76#issuecomment-475867331, or mute the thread https://github.com/notifications/unsubscribe-auth/AQP9Pxpc1R4XPjXOssenOzrI3SUOQnvDks5vZiOcgaJpZM4bm7Nt .

-- David Oliver Oliver Systems, LLC 366 6th Street, #2 Brooklyn, NY 11215 718 360-8285 oliversystems.net

Gmousse commented 5 years ago

Hi @dwoliver,

I don't know if you are familiar with javascript but you can build it easily by using functions.

Few examples:

const PREDICATES = [
    row => row.get("Tot_Travel_Time") === 9458,
    row => row.get("Tot_Travel_Time") >= 8738,
    row => row.get("Tot_Travel_Time") <= 11368,
];

Then you have multiple choices:

// With a loop

df.filter(row => {
    for (const predicate of predicates) {
        if (!predicate(row)) {
             return false;
        }
    }
    return true;
})

Or:

// With reduce

df.filter(row => predicates.reduce((rowMatch, predicate) => rowMatch && predicate(row), true))

That was just few examples, you can do this by many many ways.

Guillaume

dwoliver commented 5 years ago

Guillaume, Thanks for your response. Your example does not address my situation, which I did not explain clearly. My application takes a table of data and sets up a dataframe dynamically based on the table which can have any number of columns. Then the user is given options to filter the table with equalities or ranges. So the method of setting up every condition in advance is not practical. I will have an array of column names and and array of column conditions, set by the user. How can I walk those areas and build the filter function dynamically in response to each user request. My lack of JavaScript experience leaves me not knowing how to dynamically generate the function statement for row.get(), which may have a long list of && elements. Thanks, David

On Mar 25, 2019 3:24 AM, "Guillaume Mousnier" notifications@github.com wrote:

Hi @dwoliver https://github.com/dwoliver,

I don't know if you are familiar with javascript but you can build it easily by using functions.

Few examples:

const PREDICATES = [ row => row.get("Tot_Travel_Time") === 9458, row => row.get("Tot_Travel_Time") >= 8738, row => row.get("Tot_Travel_Time") <= 11368, ];

Then you have multiple choices:

// With a loop df.filter(row => { for (const predicate of predicates) { if (!predicate(row)) { return false; } } return true; })

Or:

// With reduce df.filter(row => predicates.reduce((rowMatch, predicate) => rowMatch && predicate(row), true))

That was just few examples, you can do this by many many ways.

Guillaume

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gmousse/dataframe-js/issues/76#issuecomment-476082619, or mute the thread https://github.com/notifications/unsubscribe-auth/AQP9P1iDwyZHNYeAQNXjO7may06qNEoqks5vaHmogaJpZM4bm7Nt .

Gmousse commented 5 years ago

Hi,

I understand your need. It's not quite difficult. One easy way is to create handlers (e.g. dictionnary of functions) which basically translate user interactions into js code.

Here a fast (and minimalist) example about user interactions:

index.html

<html>
    <head>
        <script src="https://gmousse.github.io/dataframe-js/dist/dataframe.js"></script>
        <style>
            #app {
                display: flex;
                flex-direction: column;
            }

            table,
            td {
                border: 1px solid #333;
            }

            thead,
            tfoot {
                background-color: #333;
                color: #fff;
            }

            #table {
                min-width: 40em;
            }
        </style>
    </head>
    <body>
        <h1>My titanic table application</h1>
        <section id="app">
            <form id="controls">
                <label for="controls__select__class">Class</label>
                <select id="controls__select__class" autocomplete="off">
                    <option value="all" selected>All</option>
                    <option value="1st class">First class</option>
                    <option value="2nd class">Second class</option>
                    <option value="3rd class">Third class</option>
                </select>
                <label for="controls__select__age">Age</label>
                <select id="controls__select__age" autocomplete="off">
                    <option value="all" selected>All</option>
                    <option value="adults">Adult</option>
                    <option value="child">Child</option>
                </select>
                <label for="controls__select__gender">Gender</label>
                <select id="controls__select__gender" autocomplete="off">
                    <option value="all" selected>All</option>
                    <option value="man">Man</option>
                    <option value="women">Woman</option>
                </select>
                <button id="controls__submit" type="submit">
                    Submit filters
                </button>
            </form>

            <table id="table">
                <thead id="table__header"></thead>
                <tbody id="table__body"></tbody>
            </table>
        </section>
        <script>
            const DataFrame = dfjs.DataFrame;
            DataFrame.fromCSV(
                "https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanic.csv"
            ).then(titanicDf => {
                titanicDf = titanicDf.rename("", "id");

                function renderColumns(df) {
                    const tableHeader = document.getElementById(
                        "table__header"
                    );
                    const tableColumns = document.createElement("tr");

                    df.listColumns().forEach(column => {
                        const colElement = document.createElement("th");
                        const colText = document.createTextNode(column);
                        colElement.appendChild(colText);
                        tableColumns.appendChild(colElement);
                    });

                    tableHeader.appendChild(tableColumns);
                }

                function renderRows(df) {
                    const tableBody = document.getElementById("table__body");
                    while (tableBody.firstChild) {
                        tableBody.removeChild(tableBody.firstChild);
                    }
                    df.toArray().forEach(row => {
                        const rowElement = document.createElement("tr");
                        row.forEach(cell => {
                            const cellElement = document.createElement("td");
                            const cellText = document.createTextNode(cell);
                            cellElement.appendChild(cellText);
                            rowElement.appendChild(cellElement);
                        });
                        tableBody.appendChild(rowElement);
                    });
                }

                function createBaseHandler(elementId, columnName) {
                    return row => {
                        const value = document.getElementById(elementId).value;
                        if (value === "all") return true;
                        return row.get(columnName) === value;
                    };
                }

                const FILTER_HANDLERS = [
                    createBaseHandler("controls__select__class", "class"),
                    createBaseHandler("controls__select__age", "age"),
                    createBaseHandler("controls__select__gender", "sex")
                ];

                document.getElementById("controls").onsubmit = function(event) {
                    event.preventDefault();
                    renderRows(
                        titanicDf.filter(row => {
                            for (const handler of FILTER_HANDLERS) {
                                if (!handler(row)) return false;
                            }
                            return true;
                        })
                    );
                };

                renderColumns(titanicDf);
                renderRows(titanicDf);
            });
        </script>
    </body>
</html>

The idea is:

Guillaume

PS: You can try the example in your browser.

dwoliver commented 5 years ago

Guillaume,

Thanks for the detailed example. I will look it over and try to implement for my project and then report back.

David

On Mon, Mar 25, 2019 at 6:14 PM Guillaume Mousnier notifications@github.com wrote:

Hi,

I understand your need. It's not quite difficult. One easy way is to create handlers (e.g. dictionnary of functions) which basically translate user interactions into js code.

Here a fast (and minimalist) example about user interactions:

index.html

My titanic table application

The idea is:

  • A user make interactions
  • You trigger handlers for each interactions (in order to translate in predicates)
  • You combine the predicates and you apply them on the data
  • You rerender

Guillaume

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gmousse/dataframe-js/issues/76#issuecomment-476398172, or mute the thread https://github.com/notifications/unsubscribe-auth/AQP9P2PczbdI4EsB8jg1gmU0hkwoHAMBks5vaUpMgaJpZM4bm7Nt .

-- David Oliver Oliver Systems, LLC 366 6th Street, #2 Brooklyn, NY 11215 718 360-8285 oliversystems.net