Gmousse / dataframe-js

No Maintenance Intended
https://gmousse.gitbooks.io/dataframe-js/
MIT License
460 stars 38 forks source link

[QUESTION] How to reference dataframe after creating it? #94

Closed starlingfire closed 5 years ago

starlingfire commented 5 years ago

How do I reference a dataframe after creating it? Is it possible to reference a dataframe outside of the promise that created it, and if so, how do I accomplish this?

Additional context My goal is to create a dataframe and populate from a CSV, and then be able to reference it outside of the original promise. Reading https://github.com/Gmousse/dataframe-js/issues/50 pointed me in the right direction of populating the dataframe and creating a filtered dataframe with a subset of data:

var DataFrame = require('dataframe-js').DataFrame;

DataFrame.fromCSV('/opt/test-app/test.csv')
  .then(df => {
       const myFilteredDf = df.filter(row => row.get("cost") > 1).select("name","sport","cost");
       myFilteredDf.show(3);
  }); 

Yields:

| name      | sport     | cost      |
------------------------------------
| Bat       | baseball  | 100       |
| Baseball  | baseball  | 200       |
| Racket    | tennis    | 300       |`

That's great, and I can work with that, but I also want to be able to pull a record from the dataframe elsewhere in my code--many times--without reading from the CSV each time. I tried to create a new filtered dataframe after the above code block, but it did not work.

var DataFrame = require('dataframe-js').DataFrame;

DataFrame.fromCSV('/opt/test-app/test.csv')
  .then(df => {
       const myFilteredDf = df.filter(row => row.get("cost") > 1).select("name","sport","cost");
       myFilteredDf.show(3);
  }); 
myOtherFilteredDf = df.filter(row => row.get("cost") > 1).select("name","sport","cost");
myOtherFilteredDf.show(3);

Yields:

ReferenceError: df is not defined

Is my goal feasible, and if so, can you please point me in the right direction in accomplishing it? Many thanks for this tool!

System details I am running node v10.16.0, and am executing as

$ node df_test.js
Gmousse commented 5 years ago

Hi @starlingfire, thank you for using this library. Your question is not inherent to the library, but it's related to Javascript mechanics. I guess you begin Javascript.

I will answer point by point.

First, about the error ReferenceError: df is not defined.

That's not related to the asynchronous call itself but it's related to the variable scope. Indeed, from outside your function you try to access a variable (or a constant) which was declared in this function. It can't be done because the variable is out of context. That's also relevant in other programming languages.

Example:

var x = "hello";
function myfunc () {
    var y = "world";
}

console.log(x); // Print hello
console.log(y); // Throw Reference Error

That's not a big deal, but you need to understand how it works. Refers to https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Instructions/var for example.

Second, about the asynchronous calls using promises.

As you know, JavaScript is asynchronous. It means that when you want to fetch an external resource, you will execute the code in asynchronous context, a non blocking code which will returns the result when it will be ready. It's basically the same thing than callbacks, but with an alternative api.

Refers to https://exploringjs.com/es6/ch_promises.html or https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/Promise for example.

It induces that you can't safely use a variable modified under a promise or a callback because it's executed under a different, desynchronized context.

Example:

let myresult = null;

fetch("https://jsonplaceholder.typicode.com/photos").then(
    response => response.json()
).then(
    result => {
         myresult = result;
         console.log("The result is coming, you are sure the result is ready: ", myresult);
    }
);
console.log("You're not sure the result is ready: ", myresult);

The second console.log can be executed before the first, because it's executed in a different timeline. It's basically how JavaScript works. When you start working in an asynchronous context, you must continue in order to be safe.

If you re not confident with promises, consider using Async / Await https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Instructions/async_function (it's an alternative api for javascript asynchronous stuff).

Example of an implementation:

var DataFrame = require('dataframe-js').DataFrame;

async function main() {
    const df = await DataFrame.fromCSV('/opt/test-app/test.csv');
    const myFilteredDf = df.filter(row => row.get("cost") > 1).select("name","sport","cost");
    myFilteredDf.show(3);
    // Continue to work here
}

// Don't try to use df or myFilteredDF here

Tell me if that's not clear.

If you need to understand how JS works, consider these books:

starlingfire commented 5 years ago

Thanks for the reply! I'm not actually a complete beginner to JS, though my question does make it seem that way. Your explanations are great, and your example helped me solve the issue I was having.

I modified the code to instantiate a global dataframe "mydf" before creating the promise, and then assigning it the value of "df" inside the promise, and finally referencing it outside the promise after a timeout to ensure it was populated.

var DataFrame = require('dataframe-js').DataFrame;

let mydf = null;

const df = new DataFrame.fromCSV('/opt/streamlabs-socket-client/test.csv')
.then(
  df => {
    mydf = df;
  }
);

setTimeout(function() {
  myOtherFilteredDf = mydf.filter(row => row.get("cost") > 1).select("name","sport","cost");
  myOtherFilteredDf.show(3);
}, 3000);

Now I can create filtered dataframes by referencing "mydf" whenever I need to, though I will probably put a better check in place that the dataframe was successfully created instead of using the timeout function. Thanks again for your help!