Closed pjgao closed 6 years ago
You may be using an earlier version of the feature definitions. With the latest release of Featuretools (v=0.3.0), older saved feature definitions are not compatible. Make sure you are using the latest version of the feature definitions which can be found here and are called features.txt
.
Hello, Will!
With the new features.txt
and after upgrading Featuretools(v=0.3.0), I split the data to 120 partitions(RAM 48G, 24 cores). It only took 1120.01 seconds to run the b.compute()
(It seems it costs only 18 minutes to execute dfs in this dataset, right?).
And I do get p1_fm.csv
to p120_fm.csv
, but some errors may have occurred?
Besides, using single core to process the first part data cost only 126s, Can I say that without dask I need 126s*120(partitions) = 15120s = 4.2h
on my machine? It seems different from your saying " takes about 25 hours on an AWS EC2 machine"
Thanks for the update! We have noticed some of the same warning messages when we run dfs
but I don't think they affect the calculation of the features. We are working on fixing these issues. One way to check that the calculation was successful would be to join together the individual feature matrices into one and make sure the size is as expected.
In regards to the amount of time, Featuretools v0.3.0 is much faster (at least 50% in most cases) than earlier versions. Below is a graph comparing the speedup of the newest version on a number of tests we run on each release!
It might be possible that running on a single core would only take ~ 4 hours. However, I still think there is a significant speedup with Dask because you are able to use all cores on your machine. Also, this dataset is not that large so running the entire calculation at once is possible. For larger datasets that can't fit in memory, using the partitioning and running in parallel approach is the only way to complete the calculation. Learning Dask (or another parallelization framework such as Spark) is a good time investment if you want to work with large datasets and use your hardware efficiently.
I am very surprised to hear this news, thanks for your contribution!
Another question, Will. Can Featuretools extracts features from single table (like Titanic data set), if so, what kind of features it can extract?
Yes, Featuretools can extract features from a single table using Transform
primitives. You can see a list of all transform primitives on the docs. Transform primitives combine different columns, such as through arithmetic operations, or extract additional information from columns, such as the time of day or day of week. An example of using transform primitives can be found in the Loan Repayment notebook
You can also create additional entities from a single table by normalizing the table. There is an example of this in the Retail Spending notebook. I haven't used Featuretools on the Titanic dataset, but it should be possible to create more entities using the passenger class (Pclass
) or the cabin (Cabin
). Once you have created more entities, then you can use aggregation primitives to make additional features.
As a side note, a good place for posting general questions about Featuretools is on StackOverflow. Tag the question with Featuretools so we'll see it! We enjoy answering questions, but they can help out more people on a larger forum such as Stack Overflow.
@pjgao Are you still having issues or did using the new features solve the problem?
I ran the notebook
Featuretools on Dask.ipynb
on my local machine, however something wrong happened whenb.compute()
run. 10 feature matrix have generated when the error happen. Here are the error info: