data-forge / data-forge-ts

The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
http://www.data-forge-js.com/
MIT License
1.34k stars 77 forks source link

Renaming columns #6

Closed phrz closed 5 years ago

phrz commented 6 years ago

Is there a better way to rename columns in a DataFrame than the following code? I could not find any information for renaming columns in the Guide or API doc. In this code, I'm trying to rename columns "Like This" to "like_this".

slugify = (s) => s.toLowerCase().replace(' ','_').replace('/','_or_')

rawData = 
    DataForge.readFileSync('data.csv')
    .parseCSV({ dynamicTyping: true })

for(let name of rawData.getColumnNames()) {
    let s = rawData.getSeries(name)
    rawData = rawData.dropSeries(name).withSeries(slugify(name), s)
}
ashleydavis commented 6 years ago

Yeah there is! Are you able to use the function renameSeries?

const renamedDf = rawData.renameSeries({ "OldColumnName": "NewColumnName" });

You might be able to use it like this:

let rawData = 
    DataForge.readFileSync('data.csv')
    .parseCSV({ dynamicTyping: true })

let renameSpec = {};

for(let name of rawData.getColumnNames()) {
   renameSpec[name] = slugify(name);
}

rawData = rawData.renameSeries(renameSpec);

Otherwise I'll have to think about this, seems like there should be a better way to do it!

phrz commented 6 years ago

Not to overcomplicate things but I switched this little program over to TypeScript to see if it made things easier, so specifically the below code (analogous to your suggestion) did in fact work fine with all my strict tsconfig.json settings.

let renameSpec: {[index: string]: string;} = {}

for(let name of rawData.getColumnNames()) {
    renameSpec[name] = slugify(name)
}

rawData = rawData.renameSeries(renameSpec)

However, I'm curious: given the definition of IColumnRenameSpec, why does this function not accept a renameSpec that is a Map? For example, if I produced renameSpec as follows:

let renameSpec: Map<string, string> = 
    rawData.getColumnNames().map((s: string) => [s, slugify(s)])

I am by no means a TypeScript expert, so I'm honestly wondering why.

ashleydavis commented 6 years ago

You are correct! It should be using Map already. Thanks for leaving feedback and pointing it out.

I'll leave this issue open until I get it fixed.

ashleydavis commented 6 years ago

Hey I tried using Map but it didn't seem to work the way I hoped.

It makes the unit tests fail to compile.

Specifically I can't do this with map:

        var df = new DataFrame({
            columns: {
                A: [1, 2, 3, 4],
                B: ['a', 'b', 'c', 'd'],
            },

            index: [11, 12, 13, 14],
        });

Where the 'columns' field is a Map an initialized from a literal object.

I extracted the 'columns' initializer to a separately typed variable so I could more clearly see the error. Here is the extracted code:

        const columns: Map<string, Iterable<any> | ISeries<any, any>> = {
            A: [1, 2, 3, 4],
            B: ['a', 'b', 'c', 'd'],
        };

This is the error:

[ts]
Type '{ A: number[]; B: string[]; }' is not assignable to type 'Map<string, Iterable<any> | ISeries<any, any>>'.
  Object literal may only specify known properties, and 'A' does not exist in type 'Map<string, Iterable<any> | ISeries<any, any>>'.

What do you think the problem is?

Am I trying to use Map the wrong way?

phrz commented 6 years ago

Thanks for your patience! I did a lot of reading about this topic, and there seems to be no way in Typescript to use polymorphism or generics to treat Objects and Maps the same. Typescript does not have Generic specialization, or otherwise a way to have two functions with the same name accepting different types as parameters. As such, my proof of concept uses (abuses?) instanceof. Run-time type checking is ugly, but it's the only solution I could find for this specific issue.

I threw this PoC together quickly, but it boils down to some type definitions to make things more readable (only for the sake of example), followed by a function loadDataFrameSpec that uses instanceof to decide whether to pass the spec parameter onto loadDataFrameSpecMap or loadDataFrameSpecObject, which can be thought of as the "specialized" versions of the pseudo-generic loadDataFrameSpec. These dummy functions just iterate over their respective parameters to prove that you can get identical output through different iteration behavior.

I define key and column during Map iteration to prove that these elements can be extracted from the Map while maintaining type safety. I don't do this for the Object because I assume you are already successfully doing just that. All in all, sorry this PoC doesn't really tie into your code and use its internal types, I haven't had the time to pore over the source code very much.


// target: es2015

import * as DF from 'data-forge';

type ColumnType = Iterable<any> | DF.ISeries<any,any>;
interface StringIndexedObject<V> { [key: string]: V; };
type DataFrameSpecObject = StringIndexedObject<ColumnType>;

// this function checks type and calls
// the next two functions
function loadDataFrameSpec(spec: any) {
    console.log('\nCalled loadDataFrameSpec');
    if(spec instanceof Map) {
        loadDataFrameSpecMap(spec);
    } else if(spec instanceof Object) {
        loadDataFrameSpecObject(spec);
    } else {
        console.warn('Unexpected type! Do something...');
    }
}

function loadDataFrameSpecMap(spec: Map<string, ColumnType>) {
    console.log('loadDataFrameSpecMap running');
    for(let entry of spec.entries()) {
        const key: string = entry[0];
        const column: ColumnType = entry[1];
        console.log(`${key} => ${column}`);
    }
}

function loadDataFrameSpecObject(spec: DataFrameSpecObject) {
    console.log('loadDataFrameSpecObject running');
    for(let key in spec) {
        console.log(`${key} => ${spec[key]}`);
    }
}

// example
const plainOldObject = {
    A: [1, 2, 3, 4],
    B: ['a', 'b', 'c', 'd'],
};

const mapInstead = new Map<string,any>([
    ['A', [1,2,3,4]],
    ['B', ['a','b','c','d']]
]);

loadDataFrameSpec(plainOldObject);
loadDataFrameSpec(mapInstead);

The output


Called loadDataFrameSpec
loadDataFrameSpecObject running
A => 1,2,3,4
B => a,b,c,d

Called loadDataFrameSpec
loadDataFrameSpecMap running
A => 1,2,3,4
B => a,b,c,d
ashleydavis commented 6 years ago

Wow thanks for doing the research on this.

You are welcome to try and add this. I'll accept pull request for this as long as tests still work and it doesn't complicate the code too much.

ashleydavis commented 6 years ago

Hey Paul,

Just wondering if you've seen my new library yet?

It's called Data-Forge Plot and integrates plotting/charting with Data-Forge.

Please check out my blog post on it.

It's early days yet, but I'm trying to collect feedback on it.

phrz commented 6 years ago

Ashley, I'd be happy to take a look in my free time. I'm still waiting for the opportunity to sit down and work on this patch if possible.

ashleydavis commented 6 years ago

Awesome, let me know what you think!

grahamalama commented 6 years ago

Hey there. Don't know if this is useful in any way, but I found that a relatively simple solution to rename columns was to access the df.content.columnNames and df.content.values.columnNames properties of the dataframe and pass them a new array of column names. You could either just pass a new array like I did, or maybe apply some function (slugify, like @phrz 's example) to the array in place.

ashleydavis commented 6 years ago

Be careful with the content property!

This is internal, undocumented and it is marked as private in TypeScript.

I reserve the right to change the way this works under the hood in the future, which will break your code if you depend on it.

Also the content property is lazily generated once the datafame is evaluated, it's set to null otherwise so you can't rely on it.

If you'd like to suggest a better API for renaming columns that is more convenient, I'd be happy to look into implementing it.

ashleydavis commented 5 years ago

There doesn't seem to be an action that can be take for this issue, so I'm closing it.