allow more data types - Githubissues

ghost commented 8 years ago

Have been playing around with this and it looks really cool, but I notice nearly all of the data types are just files and/or resources which is kinda limiting. All of my data is in SQL or NoSQL and is only accessible via ORM. There is no convert to resource capability, and no way to directly pass a database resource that I can find.

Secondly, the last thing I want to do is query a database, write to a file (SLOOOOOOW) for 5000 concurrent users, each one getting their own file. Being able to train from array or object would be really awesome, but I couldn't figure out where to modify in the source and do it myself yet.

geekgirljoy commented 8 years ago

Hi @j88per, I agree the FANN library is really cool! PHP FANN is a wrapper for FANN. You might find this reference useful http://php.net/manual/en/book.fann.php

Additionally I have a set of tutorials you can find here: Getting Started with Neural Networks Pathfinding from Scratch

And I will be posting some additional tutorials soon.

As far as your questions and points:

"I notice nearly all of the data types are just files and/or resources which is kinda limiting. All of my data is in SQL or NoSQL and is only accessible via ORM. There is no convert to resource capability, and no way to directly pass a database resource that I can find."

It sounds like what you want is for FANN to 'normalize' your data automatically. Yes, it would be nice if FANN could just connect to anything and prepare your data for you automatically, but generally speaking it isn't too difficult to do and usually you will be doing things with your data that will likely be unique to your dataset or methodology and if you wanted to do something different than how the Dev's designed the normalization to work, you'd still end up doing it by hand anyway so 6 in 1 half a dozen in another.

Thankfully data normalization for a neural network is surprisingly simple when you think about it.

Here are the basics:

The ANN can only use numbers between -1 and 1 but there are an infinite number of numbers in that data set when you include floating numbers e.g: 0.1337

Now there are upper limits on what a variable can store on any given system and PHP.net says this in regards to floating point numbers:

"The size of a float is platform-dependent, although a maximum of ~1.8e308 with a precision of roughly 14 decimal digits is a common value (the 64 bit IEEE format)."

This means that you can store up to 64 bits (8 bytes) of floating point data. In the 64 bit representation of the floating point number the first 11 bits are the exponent and the last 52 are the fraction meaning that you could feasibly store a number up to 1.8e+308.

That's 1.8 multiplied by 10 to the 308th power, this is a staggeringly large number!

As such, we can be certain that if we represent data this way we have plenty of numbers to work with between -1 and 1.

So first, you have to normalize your data.

Lets say you have a group of 5 users: Jane Doe, Sally May, Mark Guy, Phil Simmons, Phil Buttler

Each record has a unique ID of some sort regardless if you are using SQL or a NoSQL system. The record or object has a UUID.

Instead of representing the UUID "User Ids" as 1,2,3,4, 5

You can convert them to 0.1, 0.2, 0.3, 0.4, 0.5

Not too bad right? :-)

If you wanted to process something that might have duplicates like the people's names or a city name, or state etc. you first select all the unique values (names, states) from the data set, assign it a floating value, then apply the value to the original record.

Example:

First Names: Jane, Sally, Mark , Phil, Phil

Unique Names: Jane, Sally, Mark , Phil

Converted to float: 0.1, 0.2, 0.3, 0.4

Jane = 0.1 Sally = 0.2 Mark = 0.3 Phil = 0.4

Original Dataset: Jane, Sally, Mark , Phil, Phil Normalized Dataset: 0.1, 0.2, 0.3, 0.4, 0.4

Still not too bad right? :-)

All you are doing is taking unique values and giving them a common number to share, the number just happens to be between -1 and 1.

So you connect to your database and retrieve some object or record set and convert it depending on your needs then store it back in your database in some way that it is correlated with the original value or record as your circumstances may require.

Then when building ANN's you create the number of inputs / outputs you need and feed it the processed values.

The data returned by the ANN can be stored or immediately used as is or converted back to its 'human readable' form and then used.

A lot of the details of what you do with the data and how you train the neural networks can depend on your project goals and what you need the neural networks to do.

As to your second point:

"Secondly, the last thing I want to do is query a database, write to a file (SLOOOOOOW) for 5000 concurrent users, each one getting their own file. Being able to train from array or object would be really awesome, but I couldn't figure out where to modify in the source and do it myself yet."

I totally agree! That would be a terrible methodology! :-P Perhaps a better methodology would be pre-process the data as much as possible as early as possible. Convert the data into a floating point number before you need it and save it in the database.

Then when you need it you pull from the db. No disk/file writes. Everything is done in memory which is significantly faster.

The only reason why the examples that are included with PHP FANN use external files is to keep the examples simple. Perhaps we can get a DB example put together to include with PHP FANN. :-)

If you need specific help, feel free to share your code and I would be happy to offer advice.

ghost commented 8 years ago

Sorry I didn't get back to this sooner, been busy on other stuff and github is not my primary tool, I log in about once every 2 months. I appreciate you taking the time to respond.

I think I wasn't as clear as I should have been, I'm not asking for FANN to normalize my data. Just connect to the resource. I can't create a resource myself due to PHP restrictions. I only get things like SQL or File. PHP-FANN seems to favor file, as the whole ANN concept seems to favor files. I have the distinct impression that most ANN usage happens in a pre-processed or post-processed environment. So from my experimentation with PHP-FANN, I have to create a file with my test data, then re-load that file for the actual run. Your explanations on normalizing data for ANN's is very helpful and I can and will use it, I'm just not there yet. I just want to store the output from training function in a faster medium that makes unique storing easier, and load that in similar to create_from_file, but obviously not from a file. I want to get away from the whole file thing as I want to build something that's real-time and dynamically assessing new data as well as historical data using back propagation.
Thanks

ghost commented 8 years ago

The note you added to this page is exactly what I am looking for, http://php.net/manual/en/function.fann-create-train-from-callback.php

The only thing I would add here, is that the example doesn't work immediately inside classes, but is easily remedied by changing the callback parameter to an array of [$this, 'name_of_callback_method'].

Thanks for your assistance. I think the challenge here is that the docs rely heavily on a training data resource object without every really getting into how that's created or where it comes from. So nearly every example on the web is train_from_file as that's the easiest way to demonstrate. Unfortunately the links in the php docs always link resource to the PHP resource definition which is limited. Instead linking to fann_create_train_from_callback or even create_train_from_file as it's a valid resource as well would have made it easier for me. Thanks again.

bukka commented 7 years ago

Yeah currently just fann_create_train_from_callback is the way how to do that...

bukka / php-fann

allow more data types #28