Open festeh opened 3 years ago
Hi @festeh ,
Did you use the download script as mentioned here -> https://github.com/Building-ML-Pipelines/building-machine-learning-pipelines ? We have preprocessed the data set to ease the process for readers. The entire data set requires more RAM and we weren't sure if readers have access to instances with those requirements.
Yes, but I would also like to reproduce it, and currently reproducing instruction in readme is not completed. The full dataset has some rare pitfalls, like '+' sign in zip code, that would break the pipeline. So it would be great to list full steps needed to obtain the final dataset or remove this confusing section.
Hi!
I downloaded the dataset from https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data, applied the suggested preprocessing steps (did not forget to clean rows with empty
consumer_complaint_narrative
) and saved to csv file. It has 600718 rows, while your csv file has only 66800 rows. Is the source link correct, or some preprocessing step is missed? Thanks!