Open alexklibisz opened 8 years ago
@audrism @MBenkhayal @inthesunset, another search-related question:
If we do the feature search option like Dr. Mockus suggested, will it be possible to search for multiple different features in one search query? For example: "graphing library" and "MIT license" in one query.
My suggestion would be to do text search, where any text string can be provided by the user.
I'd like you to also focus on a more specific search use case. Here are some that may make sense from your project perspective: a) I want to use a library/framework, which one is best for me b) I want to learmn new technology, the best way is to contribute to a project that does that which one actually does that, which one is willing to accept contributions, answer questions which one is the best from seeing how to use tech, but may be not the best from participating? c) I want to be hired by facebook, etc: what should I do? e.g. what technologies people from facebook use, contribute to, which ones are obsolete and which are rising?
One of these or some mod of it would be plenty.
@inthesunset First task will be to take some readme.md files (maybe 25 - 50 different ones), and see if you can use Word2Vec to make them searchable
@inthesunset Thanks for doing that research. Can you answer a couple things just to make sure we're on the same page:
@audrism maybe you can give some insight regarding his last two points?
1) word2vec gives numeric vector for each word, so you combine these vectors for the query and for each document and then find documents most similar to the query.
2) The simplest search tool (it does tfidf+lsi here:) http://radimrehurek.com/gensim/simserver.html
While word2vec may provide a bit more precision, this or similar tools should suffice for this project.
I actually suggest to drop word2vec for search as it may be nontrivial to combine word vectors well, see, e.g., http://eng.kifi.com/from-word2vec-to-doc2vec-an-approach-driven-by-chinese-restaurant-process/
Consequently, lets just go with approach 2) for doing search. word2vec may be interesting not for search but for synonym detection, but would need many more documents.
Yes, it's very easy to look for a synonym using word2vec model.
@inthesunset Can you take a look at the second option Dr. Mockus mentioned and report back the results? Muhammed and I have started on the data discovery and the relationship calculations.
OK, I will handle that part.
I tried the second approach under python 2.7.6, It did give some expected results, but also some Error information that is irrelevant to concrete case. I think it's related to multi-threading safety of sqlite.
I searched this problem, but couldn't find the solution about ERROR, although the result is correct and useful. @audrism do you have any suggestions on this problem?
From the other issue thread:
I retrieved all the Readme files of the repos (22560), test the lsi method by giving an input of 'javascript framework', and get a result below. It seems to be plausible.
[('1766718plivoplivoframework.md', 0.6083398461341858, None), ('1446771kstenerudiOS-Universal-Framework.md', 0.5893081426620483, None), ('581371reflexreflex-framework.md', 0.5881208181381226, None), ('640868seancorfieldfw1.md', 0.5805084705352783, None), ('1618539xp-frameworkxp-framework.md', 0.5725337266921997, None), ('302794901orgappframework.md', 0.5648699998855591, None), ('2293158rapid7metasploit-framework.md', 0.5638223886489868, None), ('141320gabrielyajl-objc.md', 0.5631961226463318, None), ('3973122leemasonNHP-Theme-Options-Framework.md', 0.5497069358825684, None), ('137013tylerhallsimple-php-framework.md', 0.5433249473571777, None), ('112633inet-frameworkinet.md', 0.5379834771156311, None), ('12110738zenorochasublime-javascript-snippets.md', 0.5274854898452759, None), ('785509xitrum-frameworkxitrum.md', 0.525899350643158, None), ('1219595BonsaiDenJavaScript-Garden.md', 0.5160852670669556, None), ('640868framework-onefw1.md', 0.5117881298065186, None), ('1665787LiftUXUpThemes-Framework.md', 0.5100019574165344, None), ('1665787UpThemesUpThemes-Framework.md', 0.5100019574165344, None), ('1439738actor-frameworkactor-framework.md', 0.5037848353385925, None), ('5226339githubRebel.md', 0.4991424083709717, None), ('6498492airbnbjavascript.md', 0.4921983480453491, None), ('97936parmanoirjscocoa.md', 0.4817003309726715, None), ('61760eczarnyxmlrpc.md', 0.46824800968170166, None), ('9252851daniellmbJavaScript-Scope-Context-Coloring.md', 0.46554285287857056, None), ('1860938UnionOfRADlithium.md', 0.4551745057106018, None), ('3685302AlloyTeamJX.md', 0.44039466977119446, None), ('398515datafolklabscement.md', 0.43713828921318054, None), ('4161594bitaVersioncpfthw.md', 0.43209388852119446, None), ('2284119iKreativWorkless.md', 0.4295537769794464, None), ('94056spazprojectspazcore.md', 0.42076489329338074, None), ('2651389TapQuolungo.thirdparties.md', 0.41885727643966675, None), ('1723225Khankhan-exercises.md', 0.4096054434776306, None), ('4465091advanced-jssyllabus.md', 0.409490168094635, None)]
Each tuple consists id,fullname. For example, the first tuple, id is 1766718, fullname is plivo/plivoframework
Maybe try
@alexklibisz I've put the 9 tests examples on OSSFinder/Readme_retrieve.
@inthesunset @alexklibisz So I began to look at the test examples for the "Python MVC" test, and here is what I have found so far. I only looked at the repos with over .5 correlation (I assume that's what the number is). Before I give the results, here are some things I noticed: 1) It seems that the search is doing an OR search, i.e. it searches for "Python" OR "MVC". This led to results related to either MVC or Python, but not both 2) Some of these are quite old and have not been updated in a few years. 3) One specifically said it was broken in the realm, perhaps this is something we can filter by?
If the algorithm could be used with an AND instead of an OR that would be good. Also, we may have to do some raw text analysis on the read me ourselves to make sure they aren't broken/irrelevant.
Here are the results I got, I gave a small explanation of each repo:
2163316donalmvc.md,0.734551489353 -lightweight MVC module for teaching purposes, all done in PHP
14895315stoneniqiuFindLover.md,0.699222207069 -Some sort of MVC project using C# and KendoUI, in a different language
1088864andrewdaveypostal.md,0.580159902573 -Email sending library for ASP.NET MVC, done in C#
1843047spring-projectsspring-mvc-showcase.md,0.55679410696 -demonstrates what the Spring MVC web framework can do
1843047SpringSourcespring-mvc-showcase.md,0.55679410696 -redirects to above repo
5407062kivypyjnius.md,0.552711009979 -Python module to access Java classes using JNI
1242729smsohanMvcMailer.md,0.53835362196 -ASP.NET MVC email composer
1364268srkirklandDataAnnotationsExtensions.md,0.528128385544 -Server Side validation attributes that can be used in .NET projects w/o MVC dependency
4579552erichextertwitter.bootstrap.mvc.md,0.524285972118 -twitter’s bootstrap library for ASP.NET MVC 4 applications
645978TroyGoodeMembershipStarterKit.md,0.524157881737 -Sample skeleton ASP.NET MVC project to build off of
498035haypopysandbox.md,0.52396440506 -Python sandbox to run potentially unsafe code, currently broken
5637647AndreyAkinshinknockout-mvc.md,0.501231968403 -MVC wrapper library for the Knockout JS library
Yeah. When it goes to three dimensional visualization, there are only 5 repos. None of them focus on three dimensional visualization. Only two is relevant to 3D library. The rest repos just include word 'three'. Similarity varies from 0.54 to 0.44.(No high similarity)
Similar to Data Discovery issue, use this issue to discuss ideas and tasks for feature extraction.
Some questions/ideas/options raised so far: