accord-net / framework

Machine learning, computer vision, statistics and general scientific computing for .NET
http://accord-framework.net
GNU Lesser General Public License v2.1
4.48k stars 1.99k forks source link

Learn without codification #840

Open ConductedClever opened 7 years ago

ConductedClever commented 7 years ago

What would you like to submit? (put an 'x' inside the bracket that applies)

Issue description

Hi. My learning data is a matrix like this:

[
  [
    "REPLACEMENT",
    "AORB",
    "%",
    "*",
    "25",
    "25",
    "0",
    "0",
    "0",
    "0",
    "105",
    "null",
    "BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "MethodBodyContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "ClassBodyDeclarationContext-MemberDeclarationContext-MethodDeclarationContext-MethodBodyContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "ClassBodyContext-ClassBodyDeclarationContext-MemberDeclarationContext-MethodDeclarationContext-MethodBodyContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "CompilationUnitContext-TypeDeclarationContext-ClassDeclarationContext-ClassBodyContext-ClassBodyDeclarationContext-MemberDeclarationContext-MethodDeclarationContext-MethodBodyContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "null"
  ],
  [
    "REPLACEMENT",
    "AORB",
    "%",
    "/",
    "25",
    "25",
    "0",
    "0",
    "0",
    "0",
    "105",
    "null",
    "BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "MethodBodyContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "ClassBodyDeclarationContext-MemberDeclarationContext-MethodDeclarationContext-MethodBodyContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "ClassBodyContext-ClassBodyDeclarationContext-MemberDeclarationContext-MethodDeclarationContext-MethodBodyContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "CompilationUnitContext-TypeDeclarationContext-ClassDeclarationContext-ClassBodyContext-ClassBodyDeclarationContext-MemberDeclarationContext-MethodDeclarationContext-MethodBodyContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-StatementContext-BlockContext-BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext-ExpressionContext-PrimaryContext-ExpressionContext-PrimaryContext-LiteralContext-",
    "null"
  ],
...
]

when I do codify the data, the value (for example) BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext- gets converted to 1 and ExpressionContext- to 2 and BlockStatementContext-StatementContext-ParExpressionContext- to 3 (just examples). And the relation between the data gets lost. I mean although 1 and 3 are different but they are more similar to each other rather that 2.

Is there any way to take this consideration into account (for example in NaiveBayesianLearning)?

Thanks in advance.

cesarsouza commented 7 years ago

Hi @ConductedClever,

You might have to pre-parse your input in order to make it more recognizable by the codification algorithm. For example, instead of showing long sequences like "BlockStatementContext-StatementContext-ParExpressionContext-ExpressionContext" ... you can consider to first split those sequences using a "-" separator such that the actual words will end up receiving the same integer labels rather than having the entire sequence of words receiving a separate integer label. When doing this way, you should end up with sequences of symbols of varying length (rather than a fixed 2D matrix of symbols as it would have been expected in a normal feature classification problem).

If you do like this, in the end you would not have a int[][] matrix of symbols but rather a int[][][] matrix. I am under the impression that the Codification filter might support this case out of the box, but I am not completely sure right now. If it does not, please let me know. In the worst case it should be possible to apply the codification filter to each of the sentences in your dataset individually such that in the end you can retrieve the data as a int[][][] instead of int[][] as it would have been expected in a simple classification problem.

Once you have achieved this int[][][] representation, you might want to take a look at Dynamic Time Warp SVMs, Hidden Markov Model Classifiers or Hidden Conditional Random Fields to classify those sequences of symbols into different class labels, if that is your end goal. Please take a look at the bottom of the aforelinked pages for examples on how this could be done.

I hope it helps!

Regards, Cesar