ASSERT-KTH / CodRep

58069 Java source code diffs. http://arxiv.org/pdf/1807.03200
http://arxiv.org/pdf/1807.03200
91 stars 15 forks source link

Participant %7: Team CSV, Universidad Central "Marta Abreu" de Las Villas #20

Open chenzimin opened 6 years ago

chenzimin commented 6 years ago

Created for Team CSV(@cesarsotovalero) from the Universidad Central "Marta Abreu" de Las Villas for discussions. Welcome!

monperrus commented 6 years ago

Excellent, welcome! What's your score on Dataset1?

cesarsotovalero commented 6 years ago

My current scores using just a very naive string comparison based approach:

Score on dataset1: 0.1236735 Score on dataset2: 0.1096176

No machine learning yet.

monperrus commented 6 years ago

Yes. The first 0.8 are easy to get (purely due to the data).

The remaining points are super hard.

Best score seen so far:

cesarsotovalero commented 6 years ago

My last scores:

Dataset Perfect Match Score
Dataset 1 3867 0.11842962430821
Dataset 2 9833 0.108660931336428
Dataset 3 17197 0.0753167732657934

My current approach: string matching + parse checking

A related paper: A comparison of code similarity analysers

chenzimin commented 6 years ago

Thanks, I have updated the rankings

monperrus commented 6 years ago

good scores, getting quite close to @tdurieux :-)

cesarsotovalero commented 6 years ago

Hi everyone, I want to give an update of my scores for the preliminary ranking:

Dataset Perfect Match Score
Dataset1 3900 0.1111243868013270
Dataset2 9948 0.0995737723246198
Dataset3 17438 0.0631975953292782
Dataset4 15773 0.0769219481612277

My current approach is: string matching + parse checking + decision rules + heuristics

monperrus commented 6 years ago

It seems that you beat @tdurieux!! Congrats.

It's too late to be considered in the intermediate ranking, but it's really remarkable.

cesarsotovalero commented 6 years ago

Thanks @monperrus!! However, my approach has some performance issues. For instance, it takes almost 2h for Dataset1, which is far from the performance results of @tdurieux. Also, I think the accuracy (in terms of the loss function) should be improved much more to really win the competition. I'll continue working on that.

tdurieux commented 6 years ago

Strangely my technique is still better for the dataset 2 but worse for the others.

I still have some room for improvement but I am very happy of the performance of my technique. It takes less than 10min to have the results on all datasets. That is helping a lot to try new improvements