Building-ML-Pipelines / building-machine-learning-pipelines

Code repository for the O'Reilly publication "Building Machine Learning Pipelines" by Hannes Hapke & Catherine Nelson
MIT License
583 stars 250 forks source link

Cannot reproduce dataset #34

Open festeh opened 3 years ago

festeh commented 3 years ago

Hi!

I downloaded the dataset from https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data, applied the suggested preprocessing steps (did not forget to clean rows with empty consumer_complaint_narrative) and saved to csv file. It has 600718 rows, while your csv file has only 66800 rows. Is the source link correct, or some preprocessing step is missed? Thanks!

hanneshapke commented 3 years ago

Hi @festeh ,

Did you use the download script as mentioned here -> https://github.com/Building-ML-Pipelines/building-machine-learning-pipelines ? We have preprocessed the data set to ease the process for readers. The entire data set requires more RAM and we weren't sure if readers have access to instances with those requirements.

festeh commented 3 years ago

Yes, but I would also like to reproduce it, and currently reproducing instruction in readme is not completed. The full dataset has some rare pitfalls, like '+' sign in zip code, that would break the pipeline. So it would be great to list full steps needed to obtain the final dataset or remove this confusing section.