Closed tdurieux closed 1 year ago
CodRep is like CandyCrush, there are plenty of easy tasks at the beginning but it is super hard at the end :-)
Hey @monperrus,
The ranking does not contain my last score (see the google group), you put the old results with this commit: 4b3af7a5261d669c8de025bdb9e7ae83329c3d2e
Dataset 1: 0.1143165305556653 Dataset 2: 0.08536293667171425
could you do a pull request?
Thanks for the pull request to update your score!
the readme is now up to date with you top scores!
@chenzimin @monperrus my result for the new bench:
Dataset | Perfect Match | In Top 10 | Loss |
---|---|---|---|
Dataset 3 | 17410/18633 93% | 10 18438 98% | 0.06407823026301902 |
@tdurieux Good job, I have updated the ranking accordingly.
I did a small update on my project, here my results. (I have to keep the best score, haha :smile:)
Dataset | Perfect Match | In Top 10 | Loss |
---|---|---|---|
Dataset 1 | 3934 | 4334 98% | 0.10400918447628195 |
Dataset 2 | 10123 | 10850 98% | 0.08405196035050681 |
Dataset 3 | 17520 | 18454 99% | 0.05824092991175515 |
Dataset 4 | 15786 | 16936 98% | 0.07700153400317002 |
@cesarsotovalero is still better on the dataset 4
Execution time: < 1min per dataset
Interesting fact: when there are several times the same lines in the file, it is better to select the last one. I have no clue why. WDYT?
Interesting fact: when there are several times the same lines in the file, it is better to select the last one. Indeed fun
I have no clue why. WDYT? One idea: because code tends to be added to the end of the file, and new code tend to be more buggy than old code.
My final results:
Dataset | Perfect Match | In Top 10 | Loss |
---|---|---|---|
Dataset 1 | 3967 | 4346 98% | 0.09654287550022056 |
Dataset 2 | 11069 | 10908 98% | 0.07930336605281313 |
Dataset 3 | 17753 | 18498 99% | 0.04631417984476575 |
Dataset 4 | 16074 | 17031 99% | 0.06053843584998035 |
not bad!
I don't think we can do much better with my approach, I achieved more than 90% of perfect prediction on the 4 benches. I think it is better possible to do better with ML with the features I used.
Hi @tdurieux,
Congrats on the smart features and overall technique! After you released your code I went ahead and made some small modifications so that I could test the performance of a hybrid model that uses your features (plus some more simple text-distance features) and our learning-to-rank approach.
I also did a small grid-search over some hyper-parameters of this model to get a better understanding of what parameters work well.
A model with the following hyper-parameters trained for a relatively short time over your features can produce an incremental improvement on the 5th dataset:
# Hyper-parameters / Model
# Leaves 128
# Trees 2000
# Min. Support 8
# Stop At 30
# Model LambdaMART
root@442d3980f24a:/> python3 /app/src/guesser.py /tmp/scores.txt /data/Dataset5/Tasks | python3 /app/src/evaluate.py -d /data/Dataset5
Total files: 18366
Average line error: 0.07180536507565463 (the lower, the better)
Recall@1: 0.9273657846019819 (the higher, the better)
I think this could be pushed even further by offering more features (essentially all the features we both used) and perhaps increasing the capacity of the model further. (From my grid-search it seems that, in general, incremental increases in the number of leaves allowed in the tree improve performance on the validation set. To avoid overfitting I've switched from training/validating on datasets 1,2,3, and 4 to training just on datasets 1,2,3 and validating on the 4th set.)
Hi @jjhenkel,
That is really nice that you look at my features (sorry it is a little bit messy) and succeed to create a model from it. It is impressive.
Is there a way for you to know which features are more important for the prediction?
For extracting the feature, I extract some high-level AST of each line (basically the type of each code element, VARIABLE, CLASS, STRING, COMMENT, NUMBER, TOKEN, KEYWORD), do you think it can be used in a model?
Hi @tdurieux,
The feature importance is a good question! I'm not entirely sure yet---I think RankLib has a set of tools for looking at the importance of individual features (for certain supported models) but I haven't tried doing anything with that yet.
The approximate parse you do for each line is a nice asset! They could maybe be used to create some set similarity based features. There are some techniques that could take advantage of that information and also avoid any hand-designed features/filters. I haven't investigated any of that seriously yet but there's been some interesting work on embedding edits that could maybe be applied to this challenge.
Thanks Jordan for the follow-up, as Thomas said, that's indeed really interesting.
there's been some interesting work on embedding edits Which papers do you think of?
This paper is the one I'm thinking of: https://arxiv.org/abs/1810.13337. The embedded edits could be fed directly to the learning to rank algorithms or to some other neural architecture and, in doing so, any hand-made features could be avoided. No idea if this would work---but, it's an interesting approach I've wanted to try.
This paper is on my radar! We need an implementation of it for Java patches.
@monperrus you should update the ranking with the new best score ;)
@tdurieux Done!
Hi all,
I just did a quick naive solution based on string distance: