Normalising ids - Githubissues

cazala / synaptic

architecture-free neural network library for node.js and the browser

http://caza.la/synaptic

Other

6.91k stars 666 forks source link

Normalising ids #221

Closed dan-ryan closed 7 years ago

dan-ryan commented 7 years ago

I'm new to neural networks so I'm wondering if I have the right idea. My training data has id's which is a 32bit number. So to normalise the data I'm wondering if I should do: var normalisedId = id/Number.MAX_SAFE_INTEGER

wagenaartje commented 7 years ago

No, I don't think you should do that, in my mind you have two options:

Are id's really necessary? Do they have an effect on the output? If they don't, don't add them to the input. If you are doing some kind of timeseries, use LSTM, you can then discard the id's as well.
If you really do need an id to get the expected output, you should not divide it by the maximum integer, you should divide it by the highest id in the training data

dan-ryan commented 7 years ago

Thanks for the answer. I'm predicting a batsman score in a cricket game. So I feel ids would make it more accurate (but yet to do the testing). I'm using LSTM. The data I'm working with is (ball by ball data):

batter id, bowler id, opposite batter id, inning order, over order, points count, batting team points, is legal delivery, legal delivery order, first team id, second team id, total batter points.

I'm actually currently dividing it with the highest id in the training data. But if I want to make this more practical, I won't know how high the ids will be as new players are created all the time. Would using max int be an issue (2,147,483,647)? This is a lot less than Number.MAX_SAFE_INTEGER.

wagenaartje commented 7 years ago

Oke, I see, your id's are of importance for sure then. I'm not sure if choosing a high maximum id has an effect on the learning capability of your network (due to small numbers), but I suppose it will have some effect.

You should choose a maximum number that you know will never be exceeded, but is not too large. For example, you know there will never be more than 2,147,483,647 teams/players. I'm not really into cricket, but I assume there a no more than 1000 teams and 20000 players (at least, of which you have the (future) training data).

dan-ryan commented 7 years ago

Thanks for the advice. I'll choose a number that I believe won't be possible to hit. If it gets close then do warnings in the code which someone can fix later.