adamFinastra commented 4 years ago

Hello,

Is there any more documentation/notebooks/examples on transforming the data required for the model.

Say I have the following columns where Amount is the numerical column and tag is the categorical column:

userId | date | Amount | Tag A0|2020-01-01|200|green A0|2020-01-02|300|blue A2|2020-01-01|218|red A2|2020-01-02|242|pink A3|2020-01-04|38|red

What is the difference between data features and data attribute? I see that I would have to one hot encode Tag to be 4 columns [green, blue, red, pink] in this example so the dataframe would become

userId | date | Amount | green | blue | red | pink A0|2020-01-01|200|1|0|0|0 A0|2020-01-02|300|0|1|0|0| A2|2020-01-01|218|0|0|1|0 A2|2020-01-02|242|0|0|0|1 A3|2020-01-04|38|0|0|1|0

How would I go about creating the data features/attributes? I am having a hard time understanding how I would transform some simple data as above to work with the model.

Thank you!

fjxmlzn commented 4 years ago

Sorry for the making the confusion about the features/attributes. You can refer to the paper for detailed explanations.

DATA FORMULATION

A sample contains features and attributes. Attributes are the values associated with the entire sample. Features are the values that occur over time.

As for your data, you can treat the data from each user as one sample; the date of the first record of the user as the attribute; the amount, tag as the features.

So,

The attribute of A0 is:

0 (assuming 2020-01-01 is the first day so that we can use integers to represent the date)

The features of A0 are:

Amount, Tag
200, green
300, blue

The attribute of A2 is:

The features of A2 are:

Amount, Tag
218, red
242, pink

The attribute of A3 is:

The features of A3 are:

Amount, Tag
38, red

Here I assume that for each user, the records always have consecutive days, so we only need to model the first day. If it is not the case, you can add day difference between consecutive records of a user as an additional feature.

NUMPY FORMAT

Store the data in data_train.npz so that DoppelGANger can read it:

The data_attribute field in this case is a 3x1 matrix of values [[0], [0], [3]].

The data_feature field is a 3x2x5 matrix of values

[[[200, 1, 0, 0, 0], [300, 0, 1, 0, 0]], (use one-hot encoding to represent tags)
 [[218, 0, 0, 1, 0], [242, 0, 0, 0, 1]],
 [[38, 0, 0, 1, 0], [0, 0, 0, 0]]] (zero-padding after the time-series ends)

The data_gen_flag is a 3x2 matrix, of values [[1, 1], [1, 1], [1, 0]].

Here I use the raw values of amount and start date for illustration. You will need to normalize each of them into the range [0, 1] or [-1, 1]. README also contains some explanations of these fields.

Let me know if anything is still unclear.

adamFinastra commented 4 years ago

Thank you!

AlexPars commented 4 years ago

Question for @fjxmlzn

In the answer below you wrote The data_attribute field is a 3x2x5 matrix of values

Is there a typo, namely "The data_feature field is a 3x2x5 matrix of values" ?

fjxmlzn commented 4 years ago

@AlexPars You are right. Thanks for identifying that! Just fixed it.

fjxmlzn / DoppelGANger

Data Format [Question] #5

DATA FORMULATION

NUMPY FORMAT