DDMAL / jSymbolic2

2nd Version of jSymbolic
29 stars 3 forks source link

"NaN" and "Infinity" values still appearing in some features (especially the vertical ones) #20

Closed codaich closed 7 years ago

codaich commented 8 years ago

After extracting all current features from the SLAC and Bodhidharma data, there are still a few NaN and Infinity values in both the multi-dimensional features and the one-dimensional features, although much fewer than before.

The causes of all of these need to be investigated and eliminated, as they cause major problems during machine learning.

In cases where there is no reasonable value (e.g. the "Minor Major Triad Ratio" feature when there are 0 major triads), then the feature value must be automatically set to reasonable default values. I will make a list of features where this behaviour is apparent, and will come up with appropriate default values.

All such default values should be documented in the manual and in the class docs and description field for each feature. Some features already have defaults that are not documented, so documentation must be added for them as well.

dinamix1 commented 8 years ago

I will look into this piece by piece as the bugs may be edge cases for certain pieces. Are there any in particular that you noticed?

codaich commented 8 years ago

Just do a batch extraction of the 250 SLAC files I sent you ealier with all features enabled and then look at the CSV file. You'll see where the NaNs and infinities are occuring (the CSV files now include instance file name identifiers in the first column, so you'll be able to tell which files they belong to).

codaich commented 8 years ago

As a related side note, the features should all default to -1 (for 1-D features) or null (for multi-dimensional features) if there is an error in the input, or no input data (e.g. sequence_info == null). We can use this if there is no other reasonable default value, but it should generally be avoided if a more musiclaly reasonable default can be found.

codaich commented 8 years ago

Two most problematic features in this sense (but there are others too): Note_Density_Variability and Minor_Major_Triad_Ratio

codaich commented 8 years ago

Don't worry about this Issue, @dinamix, I'll take care of it for now, since it meshes in with the general feature checking / fixing that I'm doing.

codaich commented 7 years ago

Fixed in Commit [d8aaeea], as well as in a few earlier commits. Issue closed.