Open andreashandel opened 9 months ago
I appreciate your feedback. I have two datasets at my disposal: a private dataset from a company consisting of 27,571 observations with three columns (Customer ID, Date, and Sales Amount), which is largely clean. On the other hand, I've found a more complex dataset on Kaggle, named ‘superstoredata’, with 541,909 observations across nine variables (Invoice No., StockCode, Description, Quantity, Invoice Date, Unit Price, Customer ID, Country, and Sales), offering a broader scope for analysis. I understand the project's emphasis on dealing with real-world data complexities. Could you advise if the private dataset is suitable for our project's objectives, or should I opt for the ‘superstoredata’ to meet the project requirements?
I suggest you give the more complex dataset a try. If it ends up being too many observations to make code run quickly enough, you can always down-sample it. it will be useful to have the extra variables
This might work. The data isn't described in detail, so I have a hard time assessing it. Make sure you address topics we cover in class with the project. That means at least for some part using and processing real data. Also, the goal is to apply various models, such as the machine learning approaches we'll discuss, to the data. Developing and testing an EM algorithm is somewhat outside the scope. You can include that part if you want, but you should also cover some other algorithms and cover some of the to-come topics such as model comparison, train/test. You might be able to have that as part of the project. Just keep in mind that with the project you should at least partially demonstrate that you can apply some of the concepts we cover in class. You can of course go beyond that and do other things as well.