fdac15 / OSSFinder

Open Source Software Recommendation Engine
1 stars 1 forks source link

Analysis I: Feature extraction/search #2

Open alexklibisz opened 8 years ago

alexklibisz commented 8 years ago

Similar to Data Discovery issue, use this issue to discuss ideas and tasks for feature extraction.

Some questions/ideas/options raised so far:

  1. Feature search vs. feature extraction
    • Dr. Mockus said it would be a more reliable/robust option to setup a way to do a text-based search for features, instead of extracting features and allowing users to choose them.
  2. Options for feature search
    • A rough version of the pattern Dr. Mockus described: get the readme for every project, clean the text (capitalization, stop words, etc.), use word2vec to generate vectors that can be used for searching.
    • Someone will have to dig deeper into this actual process and how we will incorporate it to the web app for searching.
alexklibisz commented 8 years ago

@audrism @MBenkhayal @inthesunset, another search-related question:

If we do the feature search option like Dr. Mockus suggested, will it be possible to search for multiple different features in one search query? For example: "graphing library" and "MIT license" in one query.

audrism commented 8 years ago

My suggestion would be to do text search, where any text string can be provided by the user.

audrism commented 8 years ago

I'd like you to also focus on a more specific search use case. Here are some that may make sense from your project perspective: a) I want to use a library/framework, which one is best for me b) I want to learmn new technology, the best way is to contribute to a project that does that which one actually does that, which one is willing to accept contributions, answer questions which one is the best from seeing how to use tech, but may be not the best from participating? c) I want to be hired by facebook, etc: what should I do? e.g. what technologies people from facebook use, contribute to, which ones are obsolete and which are rising?

One of these or some mod of it would be plenty.

alexklibisz commented 8 years ago

@inthesunset First task will be to take some readme.md files (maybe 25 - 50 different ones), and see if you can use Word2Vec to make them searchable

inthesunset commented 8 years ago

About using word2vec on Readme.md:

alexklibisz commented 8 years ago

@inthesunset Thanks for doing that research. Can you answer a couple things just to make sure we're on the same page:

  1. The way I understand it, the collection of strings you posted on your fourth point is a mapping from a term to the matching README files? For example, if I search "widget", the mapping you created with word2vec will give me the readme files that match that word? Is that correct?
  2. Is the "model" something that can be stored in MongoDB as a static copy for us to access when doing a search, or do we have to actually run word2vec every time a search is done?
  3. Can you describe more about how you trained the model? Or even better, add the code that you used to the repository. Maybe make another directory called "feature-search".

@audrism maybe you can give some insight regarding his last two points?

audrism commented 8 years ago

1) word2vec gives numeric vector for each word, so you combine these vectors for the query and for each document and then find documents most similar to the query.

2) The simplest search tool (it does tfidf+lsi here:) http://radimrehurek.com/gensim/simserver.html

While word2vec may provide a bit more precision, this or similar tools should suffice for this project.

inthesunset commented 8 years ago
  1. I've not done the search part. I'm stuck in the procedure of using this model. The collection described above include all the words that form or constitute this model. Each word in this model is a vector.
audrism commented 8 years ago

I actually suggest to drop word2vec for search as it may be nontrivial to combine word vectors well, see, e.g., http://eng.kifi.com/from-word2vec-to-doc2vec-an-approach-driven-by-chinese-restaurant-process/

Consequently, lets just go with approach 2) for doing search. word2vec may be interesting not for search but for synonym detection, but would need many more documents.

inthesunset commented 8 years ago

Yes, it's very easy to look for a synonym using word2vec model.

alexklibisz commented 8 years ago

@inthesunset Can you take a look at the second option Dr. Mockus mentioned and report back the results? Muhammed and I have started on the data discovery and the relationship calculations.

inthesunset commented 8 years ago

OK, I will handle that part.

inthesunset commented 8 years ago

I tried the second approach under python 2.7.6, It did give some expected results, but also some Error information that is irrelevant to concrete case. I think it's related to multi-threading safety of sqlite.

code:

I searched this problem, but couldn't find the solution about ERROR, although the result is correct and useful. @audrism do you have any suggestions on this problem?

alexklibisz commented 8 years ago

From the other issue thread:

I retrieved all the Readme files of the repos (22560), test the lsi method by giving an input of 'javascript framework', and get a result below. It seems to be plausible.

[('1766718plivoplivoframework.md', 0.6083398461341858, None), ('1446771kstenerudiOS-Universal-Framework.md', 0.5893081426620483, None), ('581371reflexreflex-framework.md', 0.5881208181381226, None), ('640868seancorfieldfw1.md', 0.5805084705352783, None), ('1618539xp-frameworkxp-framework.md', 0.5725337266921997, None), ('302794901orgappframework.md', 0.5648699998855591, None), ('2293158rapid7metasploit-framework.md', 0.5638223886489868, None), ('141320gabrielyajl-objc.md', 0.5631961226463318, None), ('3973122leemasonNHP-Theme-Options-Framework.md', 0.5497069358825684, None), ('137013tylerhallsimple-php-framework.md', 0.5433249473571777, None), ('112633inet-frameworkinet.md', 0.5379834771156311, None), ('12110738zenorochasublime-javascript-snippets.md', 0.5274854898452759, None), ('785509xitrum-frameworkxitrum.md', 0.525899350643158, None), ('1219595BonsaiDenJavaScript-Garden.md', 0.5160852670669556, None), ('640868framework-onefw1.md', 0.5117881298065186, None), ('1665787LiftUXUpThemes-Framework.md', 0.5100019574165344, None), ('1665787UpThemesUpThemes-Framework.md', 0.5100019574165344, None), ('1439738actor-frameworkactor-framework.md', 0.5037848353385925, None), ('5226339githubRebel.md', 0.4991424083709717, None), ('6498492airbnbjavascript.md', 0.4921983480453491, None), ('97936parmanoirjscocoa.md', 0.4817003309726715, None), ('61760eczarnyxmlrpc.md', 0.46824800968170166, None), ('9252851daniellmbJavaScript-Scope-Context-Coloring.md', 0.46554285287857056, None), ('1860938UnionOfRADlithium.md', 0.4551745057106018, None), ('3685302AlloyTeamJX.md', 0.44039466977119446, None), ('398515datafolklabscement.md', 0.43713828921318054, None), ('4161594bitaVersioncpfthw.md', 0.43209388852119446, None), ('2284119iKreativWorkless.md', 0.4295537769794464, None), ('94056spazprojectspazcore.md', 0.42076489329338074, None), ('2651389TapQuolungo.thirdparties.md', 0.41885727643966675, None), ('1723225Khankhan-exercises.md', 0.4096054434776306, None), ('4465091advanced-jssyllabus.md', 0.409490168094635, None)]

Each tuple consists id,fullname. For example, the first tuple, id is 1766718, fullname is plivo/plivoframework

alexklibisz commented 8 years ago

Maybe try

inthesunset commented 8 years ago

@alexklibisz I've put the 9 tests examples on OSSFinder/Readme_retrieve.

MBenkhayal commented 8 years ago

@inthesunset @alexklibisz So I began to look at the test examples for the "Python MVC" test, and here is what I have found so far. I only looked at the repos with over .5 correlation (I assume that's what the number is). Before I give the results, here are some things I noticed: 1) It seems that the search is doing an OR search, i.e. it searches for "Python" OR "MVC". This led to results related to either MVC or Python, but not both 2) Some of these are quite old and have not been updated in a few years. 3) One specifically said it was broken in the realm, perhaps this is something we can filter by?

If the algorithm could be used with an AND instead of an OR that would be good. Also, we may have to do some raw text analysis on the read me ourselves to make sure they aren't broken/irrelevant.

Here are the results I got, I gave a small explanation of each repo:

2163316donalmvc.md,0.734551489353 -lightweight MVC module for teaching purposes, all done in PHP

14895315stoneniqiuFindLover.md,0.699222207069 -Some sort of MVC project using C# and KendoUI, in a different language

1088864andrewdaveypostal.md,0.580159902573 -Email sending library for ASP.NET MVC, done in C#

1843047spring-projectsspring-mvc-showcase.md,0.55679410696 -demonstrates what the Spring MVC web framework can do

1843047SpringSourcespring-mvc-showcase.md,0.55679410696 -redirects to above repo

5407062kivypyjnius.md,0.552711009979 -Python module to access Java classes using JNI

1242729smsohanMvcMailer.md,0.53835362196 -ASP.NET MVC email composer

1364268srkirklandDataAnnotationsExtensions.md,0.528128385544 -Server Side validation attributes that can be used in .NET projects w/o MVC dependency

4579552erichextertwitter.bootstrap.mvc.md,0.524285972118 -twitter’s bootstrap library for ASP.NET MVC 4 applications

645978TroyGoodeMembershipStarterKit.md,0.524157881737 -Sample skeleton ASP.NET MVC project to build off of

498035haypopysandbox.md,0.52396440506 -Python sandbox to run potentially unsafe code, currently broken

5637647AndreyAkinshinknockout-mvc.md,0.501231968403 -MVC wrapper library for the Knockout JS library

inthesunset commented 8 years ago

Yeah. When it goes to three dimensional visualization, there are only 5 repos. None of them focus on three dimensional visualization. Only two is relevant to 3D library. The rest repos just include word 'three'. Similarity varies from 0.54 to 0.44.(No high similarity)