alteryx / featuretools

An open source python library for automated feature engineering
https://www.featuretools.com
BSD 3-Clause "New" or "Revised" License
7.25k stars 879 forks source link

Gitgub link for featuretool implementaion on Spark not working #411

Closed sarvendras closed 5 years ago

sarvendras commented 5 years ago

Below Gitgub link for featuretool implementaion on Spark not working..Please suggest. https://github.com/Featuretools/predicting-customer-churn/blob/master/churn/Feature%20Engineering%20on%20Spark.ipynb

gsheni commented 5 years ago

@sarvendras hey, you're right the link doesn't seem to be working. Where are you finding this link?

sarvendras commented 5 years ago

@gsheni this link is given at featuretools site and medium.com site.. My objective is to get an example how to implement featuretool feature creation on spark. https://docs.featuretools.com/guides/performance.html https://medium.com/feature-labs-engineering/featuretools-on-spark-e5aa67eaf807

Please suggest any link having end to end featuretool implementtation on spark.

Thanks

gsheni commented 5 years ago

@sarvendras hey the issue has been fixed. The docs are updated with the correct link, and the medium post is also fixed with the correct link: https://github.com/Featuretools/predicting-customer-churn/blob/master/churn/4.%20Feature%20Engineering%20on%20Spark.ipynb

sarvendras commented 5 years ago

@gsheni Thanks a lot for providing the doc..quest one query..is it mandatory to place read and write partinioned files on S3 AWS..can we read and write at your local too just for POC purpose?

One more query i have ..As explained in below link..features(for example SUM,MEAN etc) got created for training data(lets say transactions data for last 2 months) and we wanto calculate features(SUM,MEAN etc) for new data lets say for new transactions of any single day so new features(SUM,MEAN..) will have addition to on SUM,MEAN value created on training data...means SUM(on train data)+SUM(Test Data)...or features on test data will be calculated only on testing new data?

https://docs.featuretools.com/guides/deployment.html#calculating-feature-matrix-for-new-data

Thanks in advance :) Thanks Sarvendra