2 ML models, 1 for token type and 1 for token

AML14 commented 1 year ago

After the first preliminary generation of the tokens dataset, we have too many tokens: about 5 million, even without considering empty oracles. This is most likely due to the fact that some projects have many classes (e.g., Apache Collections and Math have about 300 and 1000 classes), and class names are considered as possible tokens quite often.

This may be too much data to train the ML model, even noisy data, since class names are seldom tokens belonging to oracles.

A possible solution to this problem is to train two ML models:

The first ML model dictates the next token type (e.g., colon, arithmetical operator, class name, etc.).
The second ML model is the same originally proposed.

The first ML model would be used to filter out many tokens for the second ML model, which would only consider tokens of the predicted type. Training both models is feasible with the current format of the tokens dataset, no changes are required.

darthdaver commented 1 year ago

We have two possibilities:

For each "step" of the oracleSoFar, consider all the datapoints that refer to all the possible values of tokenClass (e.g. Arithmetical Operator, Parenthesis, Class, etc.). In this case, for each datapoint we must have a label (0 or 1) depending on whether that tokenClass is the correct one for the next token, or not (as for the second model, only one datapoints has the value 1 , while all the others have the value 0).
For each "step" of the oracleSoFar, consider a single datapoint, i.e. discarding from the dataset all possible tokenClasses that are not those with the value 1, and keeping only the datapoint of the tokenClass corresponding to the token to be added to the 'oracleSoFar. In this case we don't need the label.

The first case consider both negative and positive examples: the label is concatenated to the input, and the result is a boolean value.

In the second case, instead, the training is performed only on positive examples. The label is not concatenated to the input and the output corresponds to the correct token class.

Pros and cons

The first training approach can, in principle, helps the model to generalize better, showing both positive and negative examples. However, it's important to note that the number of negative examples relative to positive examples should be balanced. If we have a highly imbalanced dataset with a small number of positive examples and a large number of negative examples, the model may become biased toward predicting the majority class. In such cases, techniques like oversampling or undersampling can be used to address the class imbalance.
The second approach is more common (both in computer vision but also in sentiment analysis, for example)

We can try both models and see which one perform better

darthdaver commented 1 year ago

For what concern the first attempts, it is better to remove the source of information that contribute the most to the length of the input datapoints (classJavadoc and classSourceCode).

Another option is to consider only the relevant token within the class (no javadoc, only source code, or only signature of the methods of the class). In this way we try to reduce the impact on the final length, but considering the most relevant information.

The last option is to consider the hierarchical model (see corresponding issue)

AML14 / tratto

2 ML models, 1 for token type and 1 for token #25

Pros and cons