AutoViML / Auto_ViML

Automatically Build Multiple ML Models with a Single Line of Code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.
Apache License 2.0
518 stars 101 forks source link

How should I stack the data of different samples into one single dataframe? #34

Closed JulesLiu closed 1 year ago

JulesLiu commented 1 year ago

Hello! My background is biology, so I'm a beginner in this field, kind of confused about putting the whole train dataset into one single dataframe var to load. Here I will first give a general introduction of what we want to do, then the two methods to stack the data based on my guess, please tell me what's the right way to do it. And please correct anything that I misunderstand.

In our projects, we have accumulated a long term records of many samples, and we hypothesize that several parameters can reflect the chance of 'event' of interest. And most likely, the parameters in 2 months before the 'event' are useful to indicate it, and ~20 days for a higher weight. But we don't know what's the best model to use fit these parameters in, so I think AutoML is the most suitable approach to find out it, right? OK, now to the data, table 1 is an example to show how our data look like. I shorten the total time length and set the 'prediction period' to 5 days, so that the tables won't be toooo long, this applies to all the model tables in this Issue. Table 1 Example of data from one sample

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Date | parameter1 | parameter2 | … | parameter8 | EventDetection -- | -- | -- | -- | -- | -- 20221001 | xx | xx | xx | xx | no 20221002 | xx | xx | xx | xx | no 20221003 | xx | xx | xx | xx | no 20221004 | xx | xx | xx | xx | no 20221005 | xx | xx | xx | xx | no 20221006 | xx | xx | xx | xx | no 20221007 | xx | xx | xx | xx | yes 20221008 | xx | xx | xx | xx | no 20221009 | xx | xx | xx | xx | no 20221010 | xx | xx | xx | xx | no 20221011 | xx | xx | xx | xx | yes 20221012 | xx | xx | xx | xx | yes 20221013 | xx | xx | xx | xx | yes 20221014 | xx | xx | xx | xx | no

I have data from many samples, so how should I put them together to let AutoViML know these 10 entries are from sample1, the next 10 are from sample 2? Should I just link them in tandem, add a column to label which sample they are from, like in table 2? Or should I split the records from each sample by the prediction period, like in table 3, (1_1: sample1 from day1 to day5; 1_2: sample1 from day2 to day6; sample1 from day3 to day7;……; sample100 from day2 to day6; sample100 from day3 to day7;……)?

Table 2 Data from samples in tandem

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Sample# | Date | parameters | EventDetection -- | -- | -- | -- 1 | 20221001 | xx | no 1 | 20221002 | xx | no 1 | 20221003 | xx | no 1 | 20221004 | xx | no 1 | 20221005 | xx | no 1 | 20221006 | xx | no 1 | 20221007 | xx | yes 1 | 20221008 | xx | no 1 | 20221009 | xx | no 1 | 20221010 | xx | no 1 | 20221011 | xx | yes 1 | 20221012 | xx | yes 1 | 20221013 | xx | yes 1 | 20221014 | xx | no 2 | 20221001 | xx | no 2 | 20221002 | xx | no 2 | 20221003 | xx | yes 2 | 20221004 | xx | no 2 | 20221005 | xx | no 2 | 20221006 | xx | no 2 | 20221007 | xx | no 2 | 20221008 | xx | no 2 | 20221009 | xx | no 2 | 20221010 | xx | no 2 | 20221011 | xx | yes 2 | 20221012 | xx | no 2 | 20221013 | xx | yes 2 | 20221014 | xx | no 3 | 20221001 | xx | no 3 | 20221002 | xx | no 3 | 20221003 | xx | yes 3 | 20221004 | xx | yes 3 | 20221005 | xx | yes 3 | 20221006 | xx | no 3 | 20221007 | xx | no 3 | 20221008 | xx | no 3 | 20221009 | xx | yes 3 | 20221010 | xx | no 3 | 20221011 | xx | no 3 | 20221012 | xx | no 3 | 20221013 | xx | yes 3 | 20221014 | xx | no … | … | … | …

Table 3 Split one experiment sample's records into periods as independent samples for training

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Sample# | Date | parameters | EventDetection -- | -- | -- | -- 1_1 | 20221001 | xx | no 1_1 | 20221002 | xx | no 1_1 | 20221003 | xx | no 1_1 | 20221004 | xx | no 1_1 | 20221005 | xx | no 1_2 | 20221002 | xx | no 1_2 | 20221003 | xx | no 1_2 | 20221004 | xx | no 1_2 | 20221005 | xx | no 1_2 | 20221006 | xx | no 1_3 | 20221003 | xx | no 1_3 | 20221004 | xx | no 1_3 | 20221005 | xx | no 1_3 | 20221006 | xx | no 1_3 | 20221007 | xx | yes 1_4 | 20221004 | xx | no 1_4 | 20221005 | xx | no 1_4 | 20221006 | xx | no 1_4 | 20221007 | xx | yes 1_4 | 20221008 | xx | no … | … | … | … 1_9 | 20221009 | xx | no 1_9 | 20221010 | xx | no 1_9 | 20221011 | xx | yes 1_9 | 20221012 | xx | yes 1_9 | 20221013 | xx | yes … | … | … | … 3_10 | 20221010 | xx | no 3_10 | 20221011 | xx | no 3_10 | 20221012 | xx | no 3_10 | 20221013 | xx | yes 3_10 | 20221014 | xx | no

Any help would be appreciated, thanks in advance!

AutoViML commented 1 year ago

Table 2 looks good. AutoViML will be able to use the variables correctlyAuto Vimal

On Friday, August 11, 2023 at 05:20:28 PM EDT, JulesLiu ***@***.***> wrote:  

Hello! My background is biology, so I'm a beginner in this field, kind of confused about putting the whole train dataset into one single dataframe var to load. Here I will first give a general introduction of what we want to do, then the two methods to stack the data based on my guess, please tell me what's the right way to do it. And please correct anything that I misunderstand.

In our projects, we have accumulated a long term records of many samples, and we hypothesize that several parameters can reflect the chance of 'event' of interest. And most likely, the parameters in 2 months before the 'event' are useful to indicate it, and ~20 days for a higher weight. But we don't know what's the best model to use fit these parameters in, so I think AutoML is the most suitable approach to find out it, right? OK, now to the data, table 1 is an example to show how our data look like. I shorten the total time length and set the 'prediction period' to 5 days, so that the tables won't be toooo long, this applies to all the model tables in this Issue. Table 1 Example of data from one sample

| Date | parameter1 | parameter2 | … | parameter8 | EventDetection | | 20221001 | xx | xx | xx | xx | no | | 20221002 | xx | xx | xx | xx | no | | 20221003 | xx | xx | xx | xx | no | | 20221004 | xx | xx | xx | xx | no | | 20221005 | xx | xx | xx | xx | no | | 20221006 | xx | xx | xx | xx | no | | 20221007 | xx | xx | xx | xx | yes | | 20221008 | xx | xx | xx | xx | no | | 20221009 | xx | xx | xx | xx | no | | 20221010 | xx | xx | xx | xx | no | | 20221011 | xx | xx | xx | xx | yes | | 20221012 | xx | xx | xx | xx | yes | | 20221013 | xx | xx | xx | xx | yes | | 20221014 | xx | xx | xx | xx | no |

I have data from many samples, so how should I put them together to let AutoViML know these 10 entries are from sample1, the next 10 are from sample 2? Should I just link them in tandem, add a column to label which sample they are from, like in table 2? Or should I split the records from each sample by the prediction period, like in table 3, (1_1: sample1 from day1 to day5; 1_2: sample1 from day2 to day6; sample1 from day3 to day7;……; sample100 from day2 to day6; sample100 from day3 to day7;……)?

Table 2 Data from samples in tandem

| Sample# | Date | parameters | EventDetection | | 1 | 20221001 | xx | no | | 1 | 20221002 | xx | no | | 1 | 20221003 | xx | no | | 1 | 20221004 | xx | no | | 1 | 20221005 | xx | no | | 1 | 20221006 | xx | no | | 1 | 20221007 | xx | yes | | 1 | 20221008 | xx | no | | 1 | 20221009 | xx | no | | 1 | 20221010 | xx | no | | 1 | 20221011 | xx | yes | | 1 | 20221012 | xx | yes | | 1 | 20221013 | xx | yes | | 1 | 20221014 | xx | no | | 2 | 20221001 | xx | no | | 2 | 20221002 | xx | no | | 2 | 20221003 | xx | yes | | 2 | 20221004 | xx | no | | 2 | 20221005 | xx | no | | 2 | 20221006 | xx | no | | 2 | 20221007 | xx | no | | 2 | 20221008 | xx | no | | 2 | 20221009 | xx | no | | 2 | 20221010 | xx | no | | 2 | 20221011 | xx | yes | | 2 | 20221012 | xx | no | | 2 | 20221013 | xx | yes | | 2 | 20221014 | xx | no | | 3 | 20221001 | xx | no | | 3 | 20221002 | xx | no | | 3 | 20221003 | xx | yes | | 3 | 20221004 | xx | yes | | 3 | 20221005 | xx | yes | | 3 | 20221006 | xx | no | | 3 | 20221007 | xx | no | | 3 | 20221008 | xx | no | | 3 | 20221009 | xx | yes | | 3 | 20221010 | xx | no | | 3 | 20221011 | xx | no | | 3 | 20221012 | xx | no | | 3 | 20221013 | xx | yes | | 3 | 20221014 | xx | no | | … | … | … | … |

Table 3 Split one experiment sample's records into periods as independent samples for training

| Sample# | Date | parameters | EventDetection | | 1_1 | 20221001 | xx | no | | 1_1 | 20221002 | xx | no | | 1_1 | 20221003 | xx | no | | 1_1 | 20221004 | xx | no | | 1_1 | 20221005 | xx | no | | 1_2 | 20221002 | xx | no | | 1_2 | 20221003 | xx | no | | 1_2 | 20221004 | xx | no | | 1_2 | 20221005 | xx | no | | 1_2 | 20221006 | xx | no | | 1_3 | 20221003 | xx | no | | 1_3 | 20221004 | xx | no | | 1_3 | 20221005 | xx | no | | 1_3 | 20221006 | xx | no | | 1_3 | 20221007 | xx | yes | | 1_4 | 20221004 | xx | no | | 1_4 | 20221005 | xx | no | | 1_4 | 20221006 | xx | no | | 1_4 | 20221007 | xx | yes | | 1_4 | 20221008 | xx | no | | … | … | … | … | | 1_9 | 20221009 | xx | no | | 1_9 | 20221010 | xx | no | | 1_9 | 20221011 | xx | yes | | 1_9 | 20221012 | xx | yes | | 1_9 | 20221013 | xx | yes | | … | … | … | … | | 3_10 | 20221010 | xx | no | | 3_10 | 20221011 | xx | no | | 3_10 | 20221012 | xx | no | | 3_10 | 20221013 | xx | yes | | 3_10 | 20221014 | xx | no |

Any help would be appreciated, thanks in advance!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

JulesLiu commented 1 year ago

Table 2 looks good. AutoViML will be able to use the variables correctlyAuto Vimal On Friday, August 11, 2023 at 05:20:28 PM EDT, JulesLiu @.> wrote: Hello! My background is biology, so I'm a beginner in this field, kind of confused about putting the whole train dataset into one single dataframe var to load. Here I will first give a general introduction of what we want to do, then the two methods to stack the data based on my guess, please tell me what's the right way to do it. And please correct anything that I misunderstand. In our projects, we have accumulated a long term records of many samples, and we hypothesize that several parameters can reflect the chance of 'event' of interest. And most likely, the parameters in 2 months before the 'event' are useful to indicate it, and ~20 days for a higher weight. But we don't know what's the best model to use fit these parameters in, so I think AutoML is the most suitable approach to find out it, right? OK, now to the data, table 1 is an example to show how our data look like. I shorten the total time length and set the 'prediction period' to 5 days, so that the tables won't be toooo long, this applies to all the model tables in this Issue. Table 1 Example of data from one sample | Date | parameter1 | parameter2 | … | parameter8 | EventDetection | | 20221001 | xx | xx | xx | xx | no | | 20221002 | xx | xx | xx | xx | no | | 20221003 | xx | xx | xx | xx | no | | 20221004 | xx | xx | xx | xx | no | | 20221005 | xx | xx | xx | xx | no | | 20221006 | xx | xx | xx | xx | no | | 20221007 | xx | xx | xx | xx | yes | | 20221008 | xx | xx | xx | xx | no | | 20221009 | xx | xx | xx | xx | no | | 20221010 | xx | xx | xx | xx | no | | 20221011 | xx | xx | xx | xx | yes | | 20221012 | xx | xx | xx | xx | yes | | 20221013 | xx | xx | xx | xx | yes | | 20221014 | xx | xx | xx | xx | no | I have data from many samples, so how should I put them together to let AutoViML know these 10 entries are from sample1, the next 10 are from sample 2? Should I just link them in tandem, add a column to label which sample they are from, like in table 2? Or should I split the records from each sample by the prediction period, like in table 3, (1_1: sample1 from day1 to day5; 1_2: sample1 from day2 to day6; sample1 from day3 to day7;……; sample100 from day2 to day6; sample100 from day3 to day7;……)? Table 2 Data from samples in tandem | Sample# | Date | parameters | EventDetection | | 1 | 20221001 | xx | no | | 1 | 20221002 | xx | no | | 1 | 20221003 | xx | no | | 1 | 20221004 | xx | no | | 1 | 20221005 | xx | no | | 1 | 20221006 | xx | no | | 1 | 20221007 | xx | yes | | 1 | 20221008 | xx | no | | 1 | 20221009 | xx | no | | 1 | 20221010 | xx | no | | 1 | 20221011 | xx | yes | | 1 | 20221012 | xx | yes | | 1 | 20221013 | xx | yes | | 1 | 20221014 | xx | no | | 2 | 20221001 | xx | no | | 2 | 20221002 | xx | no | | 2 | 20221003 | xx | yes | | 2 | 20221004 | xx | no | | 2 | 20221005 | xx | no | | 2 | 20221006 | xx | no | | 2 | 20221007 | xx | no | | 2 | 20221008 | xx | no | | 2 | 20221009 | xx | no | | 2 | 20221010 | xx | no | | 2 | 20221011 | xx | yes | | 2 | 20221012 | xx | no | | 2 | 20221013 | xx | yes | | 2 | 20221014 | xx | no | | 3 | 20221001 | xx | no | | 3 | 20221002 | xx | no | | 3 | 20221003 | xx | yes | | 3 | 20221004 | xx | yes | | 3 | 20221005 | xx | yes | | 3 | 20221006 | xx | no | | 3 | 20221007 | xx | no | | 3 | 20221008 | xx | no | | 3 | 20221009 | xx | yes | | 3 | 20221010 | xx | no | | 3 | 20221011 | xx | no | | 3 | 20221012 | xx | no | | 3 | 20221013 | xx | yes | | 3 | 20221014 | xx | no | | … | … | … | … | Table 3 Split one experiment sample's records into periods as independent samples for training | Sample# | Date | parameters | EventDetection | | 1_1 | 20221001 | xx | no | | 1_1 | 20221002 | xx | no | | 1_1 | 20221003 | xx | no | | 1_1 | 20221004 | xx | no | | 1_1 | 20221005 | xx | no | | 1_2 | 20221002 | xx | no | | 1_2 | 20221003 | xx | no | | 1_2 | 20221004 | xx | no | | 1_2 | 20221005 | xx | no | | 1_2 | 20221006 | xx | no | | 1_3 | 20221003 | xx | no | | 1_3 | 20221004 | xx | no | | 1_3 | 20221005 | xx | no | | 1_3 | 20221006 | xx | no | | 1_3 | 20221007 | xx | yes | | 1_4 | 20221004 | xx | no | | 1_4 | 20221005 | xx | no | | 1_4 | 20221006 | xx | no | | 1_4 | 20221007 | xx | yes | | 1_4 | 20221008 | xx | no | | … | … | … | … | | 1_9 | 20221009 | xx | no | | 1_9 | 20221010 | xx | no | | 1_9 | 20221011 | xx | yes | | 1_9 | 20221012 | xx | yes | | 1_9 | 20221013 | xx | yes | | … | … | … | … | | 3_10 | 20221010 | xx | no | | 3_10 | 20221011 | xx | no | | 3_10 | 20221012 | xx | no | | 3_10 | 20221013 | xx | yes | | 3_10 | 20221014 | xx | no | Any help would be appreciated, thanks in advance! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

Thanks! This is good news to me, making it much easier to load the data and save memory.