RubixML / ML

A high-level machine learning and deep learning library for the PHP language.
https://rubixml.com
MIT License
2.04k stars 184 forks source link

Predict market trend up or down, unlabeled? #38

Open BasvanH opened 5 years ago

BasvanH commented 5 years ago

Hello,

I'm starting with ML and trying to predict stock trend up or down based on stock history. There are two challenges which I cannot seem to solve at the moment.

I have my stock history, this is data containing the price, volume and amount of trades at a certain point of time. I think I need to class this as Unlabeled data as I have not labeled them what trend a certain datapoint is in. Am I correct in this? When training the history data I get a warning it's missing labels. So I'm kind of lost how to handle/train unlabeled data.

Secondly, a timeline is also in play. I do not know how to handle this in the library.

Any help is much appreciated.

Thanks, Bastiaan

andrewdalpino commented 5 years ago

Hi @BasvanH thanks for the great question

See this issue regarding time-series datasets https://github.com/RubixML/RubixML/issues/35 - in short, we do not directly support time-series data yet

Since stock price is non-stationary, you will get best results from an algorithm that directly supports time series

Having that said ...

Supervised learners such as classifiers and regressors require a training signal in the form of labels

Unsupervised learner such as clusterers and anomaly detectors do not require labels

Your problem can be viewed as a classification one, in which the prediction will be trend 'up' or 'down,' or a regression problem where the prediction is the direction (+/-) and degree of trend from a baseline (ex. 0).

Your problem can also potentially fit into a clustering one, in which case, you can try to isolate clusters of up and down trend. You can also use an anomaly detector to predict when a stock is abnormally trending up or down.

So there are multiple ways, and also combinations of methods, that you can go about building a stock predicting system. I would avoid the unsupervised methods for now and focus on the supervised methods to start. Again, you will need a good Labeled dataset.

Are you able to automate the labeling process in any way?

Can you discretize the 'price' variable such that, if it is above a rolling (windowed) average, the label will be 'up' and in contradistinction 'down' if it is below the moving average?

BasvanH commented 5 years ago

Hi @andrewdalpino,

Thank you for taking the time to write such a detailed answer, much appreciated!

I have PHP experience, so therefore I have chosen your library as I think it's the most enhanced and complete one in PHP. Looking at other libraries in other languages would mean much more time for me to learn ML. So I will stick with you despite not having the time based algorithm yet :-) . You already done a great job!

So labeling is the way to go. Yes, I can process the history with a moving average, and determine trend based on price be up or below. I will move ahead and write this part.

First I want to start relatively simple, so with a classifier. Do you have an advice in which one to use?

andrewdalpino commented 5 years ago

@BasvanH No problem, welcome to our community!

Are you able to obtain more features for your dataset or do you just have the 3 that you mentioned?

How many samples do you have?

I would recommend starting with either Logistic Regression or Random Forest.

Logistic Regression is a simple linear classifier that has an associated tutorial here. The nice thing about Logistic Regression is that it can be partially trained (implements the Online interface) - thus, you can train it with new data as soon as it comes in. This will help the model to compensate for the fact that the data is non-stationary.

Random Forest is a non-linear ensemble method that you can try if you need a more flexible model.

Once you have enough labeled data, make sure to set about 20% of it aside to use as testing data and validate your model. The F Beta metric will give you a good idea as to how well it performs.

BasvanH commented 5 years ago

My dataset has a datapoints every full minute and contains the following features:

I calculate SMA on each datapoint based on 30 datapoints/minutes ahead.

I'm adding trend to my dataset:

I will use trend as my label, but I'm also interested if it would make sense to add the difference as a label to indicate how much up or down trend we are having.

I have my dataset ready, and I'm going to move forward to read into Logistic Regression classifier.

andrewdalpino commented 5 years ago

Looks like you are well on your way @BasvanH

Keep us updated with your progress and don't hesitate to follow up with questions

Also, given the recent interest (https://github.com/RubixML/RubixML/issues/40, https://github.com/RubixML/RubixML/issues/35), we may start implementing time series features if they will better serve our users

zenichanin commented 3 years ago

Hey @BasvanH did you have any luck with this? I'm trying to also use the RandomForest algo but the link above is broken and I did not find any examples using this algo on any demo pages.

andrewdalpino commented 3 years ago

Hey @BasvanH did you have any luck with this? I'm trying to also use the RandomForest algo but the link above is broken and I did not find any examples using this algo on any demo pages.

Here is a link to the current Random Forest documentation

https://docs.rubixml.com/latest/classifiers/random-forest.html