ASSERT-KTH / CodRep

58069 Java source code diffs. http://arxiv.org/pdf/1807.03200
http://arxiv.org/pdf/1807.03200
91 stars 15 forks source link

Participant %4: @tdurieux, INRIA #16

Closed tdurieux closed 1 year ago

tdurieux commented 6 years ago

Hi all,

I just did a quick naive solution based on string distance:

Dataset Perfect Match In Top 10 Recall Loss
Bench 1 3791 4322 98% 0.86 0.13615878141899027 
Bench 2 9910 10805 97% 0.89 0.10263617900182995 
monperrus commented 6 years ago

CodRep is like CandyCrush, there are plenty of easy tasks at the beginning but it is super hard at the end :-)

tdurieux commented 6 years ago

Hey @monperrus,

The ranking does not contain my last score (see the google group), you put the old results with this commit: 4b3af7a5261d669c8de025bdb9e7ae83329c3d2e

Dataset 1: 0.1143165305556653 Dataset 2: 0.08536293667171425

monperrus commented 6 years ago

could you do a pull request?

monperrus commented 6 years ago

Thanks for the pull request to update your score!

monperrus commented 6 years ago

the readme is now up to date with you top scores!

tdurieux commented 6 years ago

@chenzimin @monperrus my result for the new bench:

Dataset Perfect Match In Top 10 Loss
Dataset 3 17410/18633 93% 10 18438 98% 0.06407823026301902
chenzimin commented 6 years ago

@tdurieux Good job, I have updated the ranking accordingly.

tdurieux commented 6 years ago

I did a small update on my project, here my results. (I have to keep the best score, haha :smile:)

Dataset Perfect Match In Top 10 Loss
Dataset 1 3934 4334 98% 0.10400918447628195 
Dataset 2 10123 10850 98% 0.08405196035050681 
Dataset 3 17520 18454 99% 0.05824092991175515 
Dataset 4 15786 16936 98% 0.07700153400317002 

@cesarsotovalero is still better on the dataset 4

Execution time: < 1min per dataset

Interesting fact: when there are several times the same lines in the file, it is better to select the last one. I have no clue why. WDYT?

monperrus commented 6 years ago

Interesting fact: when there are several times the same lines in the file, it is better to select the last one. Indeed fun

I have no clue why. WDYT? One idea: because code tends to be added to the end of the file, and new code tend to be more buggy than old code.

tdurieux commented 6 years ago

My final results:

Dataset Perfect Match In Top 10 Loss
Dataset 1 3967 4346 98% 0.09654287550022056 
Dataset 2 11069 10908 98% 0.07930336605281313 
Dataset 3 17753 18498 99% 0.04631417984476575 
Dataset 4 16074 17031 99% 0.06053843584998035 
monperrus commented 6 years ago

not bad!

tdurieux commented 6 years ago

I don't think we can do much better with my approach, I achieved more than 90% of perfect prediction on the 4 benches. I think it is better possible to do better with ML with the features I used.

jjhenkel commented 6 years ago

Hi @tdurieux,

Congrats on the smart features and overall technique! After you released your code I went ahead and made some small modifications so that I could test the performance of a hybrid model that uses your features (plus some more simple text-distance features) and our learning-to-rank approach.

I also did a small grid-search over some hyper-parameters of this model to get a better understanding of what parameters work well.

A model with the following hyper-parameters trained for a relatively short time over your features can produce an incremental improvement on the 5th dataset:

# Hyper-parameters / Model
# Leaves       128
# Trees        2000
# Min. Support 8
# Stop At      30 
# Model        LambdaMART

root@442d3980f24a:/> python3 /app/src/guesser.py /tmp/scores.txt /data/Dataset5/Tasks | python3 /app/src/evaluate.py -d /data/Dataset5
Total files: 18366
Average line error: 0.07180536507565463 (the lower, the better)
Recall@1: 0.9273657846019819 (the higher, the better)

I think this could be pushed even further by offering more features (essentially all the features we both used) and perhaps increasing the capacity of the model further. (From my grid-search it seems that, in general, incremental increases in the number of leaves allowed in the tree improve performance on the validation set. To avoid overfitting I've switched from training/validating on datasets 1,2,3, and 4 to training just on datasets 1,2,3 and validating on the 4th set.)

tdurieux commented 6 years ago

Hi @jjhenkel,

That is really nice that you look at my features (sorry it is a little bit messy) and succeed to create a model from it. It is impressive.

Is there a way for you to know which features are more important for the prediction?

For extracting the feature, I extract some high-level AST of each line (basically the type of each code element, VARIABLE, CLASS, STRING, COMMENT, NUMBER, TOKEN, KEYWORD), do you think it can be used in a model?

jjhenkel commented 6 years ago

Hi @tdurieux,

The feature importance is a good question! I'm not entirely sure yet---I think RankLib has a set of tools for looking at the importance of individual features (for certain supported models) but I haven't tried doing anything with that yet.

The approximate parse you do for each line is a nice asset! They could maybe be used to create some set similarity based features. There are some techniques that could take advantage of that information and also avoid any hand-designed features/filters. I haven't investigated any of that seriously yet but there's been some interesting work on embedding edits that could maybe be applied to this challenge.

monperrus commented 6 years ago

Thanks Jordan for the follow-up, as Thomas said, that's indeed really interesting.

there's been some interesting work on embedding edits Which papers do you think of?

jjhenkel commented 6 years ago

This paper is the one I'm thinking of: https://arxiv.org/abs/1810.13337. The embedded edits could be fed directly to the learning to rank algorithms or to some other neural architecture and, in doing so, any hand-made features could be avoided. No idea if this would work---but, it's an interesting approach I've wanted to try.

monperrus commented 6 years ago

This paper is on my radar! We need an implementation of it for Java patches.

tdurieux commented 5 years ago

@monperrus you should update the ranking with the new best score ;)

chenzimin commented 5 years ago

@tdurieux Done!