alebj88 / Capstone-Next-Word-Predictor

Final project
0 stars 0 forks source link

How Can I Train My Own Data? #1

Open talatccan opened 6 years ago

talatccan commented 6 years ago

Hi, thank you for this amazing project.

Im not familiar to R language. Mostly i was working with Python. I want to train my own data, i just didnt understand how can i do that?

alebj88 commented 6 years ago

Hello!

I recommend using the caret package in R. With the train function, you can implement different prediction models in a very easy way. It has many friendly functions and is widely used all over the world. I show you a little example.

library(caret) #You can install it using the command install.packages("caret") data(iris)

DataPartition

int<-createDataPartition(y=iris$Sepal.Length,p=0.7,list=FALSE) training<-iris[int,] testing <-iris[-int,]

Random Forest

modFit<-train(Species~.,method="rf",data=training,prox=TRUE)

Prediction

pred<-predict(modFit,testing)

Evaluation

confusionMatrix(pred,testing$Species)

In this video you can found some information about it. https://www.youtube.com/watch?v=z8PRU46I3NY

2018-06-20 10:36 GMT-04:00 Talat CAN notifications@github.com:

Hi, thank you for this amazing project.

Im not familiar to R language. Mostly i was working with Python. I want to train my own data, i just didnt understand how can i do that?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alebj88/Capstone-Next-Word-Predictor/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/ARohEZJc0H-drRgUZZU75LA0MnVfv21gks5t-l3kgaJpZM4UvZoX .

talatccan commented 6 years ago

Thank you for fast reply.

I understood to logic of how it works. I want to train different language (Turkish) in Capstone Next Word Predictor. I look into the codes and i saw that in tables.R file, there are so many text file as dataset. (texta textb, textc, badwords etc.)

Im wondering that if should i split my data as project's datasets.

alebj88 commented 6 years ago

I split my data because my computer didn't have enough memory (only 2Gb) and the original datasets had so many rows. If I had had more RAM, the predictor would have worked much better.

That was the reason why I didn't use a standard machine learning algoritm. All failed because of low memory.

What you need to do is create the filters for that language and be aware that this project has BigData things. You could be interested in packages like "ff" and "ffbase", they are good options to work with large dataset.

Divide and conquer. Greetings El 21/06/2018 09:32, "Talat CAN" notifications@github.com escribió:

Thank you for fast reply.

I understood to logic of how it works. I want to train different language (Turkish) in Capstone Next Word Predictor. I look into the codes and i saw that in tables.R file, there are so many text file as dataset. (texta textb, textc, badwords etc.)

Im wondering that if should i split my data as project's datasets.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alebj88/Capstone-Next-Word-Predictor/issues/1#issuecomment-398991450, or mute the thread https://github.com/notifications/unsubscribe-auth/ARohEXS5WJBCwogUp_wra5_bLXO-BLRwks5t-z3jgaJpZM4UvZoX .