Closed fycus-tree closed 4 years ago
Very cool, great idea. Some random observations:
@andrewcmyers This is just a proof-of-concept, and it's good to see that something very similar to Emery & the ACM SIG curation can be obtained automatically through data already in CSRankings' dataset.
There's a lot of alternative ideas one could use for this analysis (e.g. using other semantic analysis methods, or weighted schemes, along with branching into graph-based formulations).
@fycus-tree thanks for doing this analysis. This looks very interesting to me. I particularly like that this is entirely data-driven and that (as you point out) it produces many clusters that seem to be consistent with expectations and also with what Emery created. I.e., the data seems to support many of the decisions that were made to create the current areas.
Since my research area is medical image analysis I can give you a little bit of extra context/interpretation for your clustering results in this regard:
As far as your method is concerned: if you say "build a matrix of authors (R1 faculty included in CS Rankings) by venues (4326 x 1360)" is this really just a binary matrix indicating if an author published in a venue or not? Or did you create a weighted matrix (i.e., where the entries are the number of publications or maybe even the co-author adjusted scores that Emery used)? Is the second option what you mean by a weighted scheme in your post above?
I like your comment regarding stability under perturbation. Any ranking will, of course, always be dependent on how scores are weighted and categories are chosen, but I think some notion of stability under perturbation is a good idea. Is it easily possible with the code you have now to compute what would happen to the rankings if one were to change the areas as you suggest? (And maybe use the geometric mean Emery uses as well as a simple additive approach as we discussed before.) That would be very interesting to see.
Thanks for this work. Very nice!
@marcniethammer
okay okay @andrewcmyers @marcniethammer
With the bug fix above, a bit more LDA dimensions (64), running more iterations, using k-means++ clustering in semantic space (so I have to pick a cluster number) and picking 26 clusters (the same as in CS Rankings now); I get this. The area label is just my own curation, it's probably imperfect, but the groups themselves are automatic.
Area | Venue 1 | Venue 2 | Venue 3 | Venue 4 | Venue 5 |
---|---|---|---|---|---|
AI & ML | NIPS | AAAI | ICML | IJCAI | AISTATS |
CV | CVPR | ICCV | ECCV | PAMI | WACV |
Data Mining | KDD | ICDM | SDM | CIKM | IEEE-TKE |
NLP | ACL | EMNLP | HLT-NAACL | LREC | GECCO |
Medical Imaging | MICCAI | IEEE-TIM | NeuroImage | IEEE-TMI | Neurocomputing |
- | - | - | - | - | - |
Arch & Design | DAC | DATE | ISCA | ASPLOS | MICRO |
Network & Mobile | ACM/IEEE TON | SIGCOMM | NSDI | IMC | MobiCom |
Communication | INFOCOM | IEEE-TMC | ICDCS | Computer Networks | IEEE-JSAC |
Security | CCS | USENIX Security | IEEE S&P | NDSS | ACSAC |
HPC | IPDPS | SC | J. Parallel Distrib. Comput. | ICPP | ICCS |
Cloud & Cluster | CCPE | CLUSTER | CCGRID | HPDC | MASCOTS |
Database & IR | PVLDB | WWW | SIGMOD | Commun. ACM | ICDE |
PL | PLDI | OOPSLA | POPL | ICFP | ECOOP |
OS | USENIX ATC | DSN | EuroSys | SOSP | OSDI |
Software Engineering | ICSE | ISSTA | ASE | SIGSOFT FSE | IEEE-TSE |
- | - | - | - | - | - |
Algorithms | SODA | STOC | FOCS | SIAM J. Comput. | ICALP |
Theory | Theor. Comput. Sci. | PODS | WINE | PODC | J. Comput. Syst. Sci. |
Crypto | CRYPTO | TCC | EUROCRYPT | ASIACRYPT | J. Cryptology |
Information Theory | IEEE-TIT | IEEE-TSP | Allerton | CISS | IEEE-TCOM |
- | - | - | - | - | - |
Comp Bio | Journal of Computational Biology | PLoS Computational Biology | AMIA | BCB | RECOMB |
Graphics | SIGGRAPH | Comput. Graph. Forum | IEEE-TVCG | SIGGRAPH Asia | IEEE CG&A |
HCI | CHI | CSCW | ICWSM | UIST | UbiComp |
Robotics | ICRA | IROS | IJRR | RSS | IEEE T-RO |
- | - | - | - | - | |
Control | CDC | ACC | IEEE-TACON | HSCC | ROBIO |
Smart Grid | IEEE-TPDS | IEEE-TC | IEEE-TIFS | IEEE-TSG | SmartGridComm |
Wireless | GLOBECOM | ICC | ICCCN | IEEE-TWC | MASS |
Clearly there's a lot of IEEE stuff in the dataset. And with the same number of categories as now, some CS stuff gets squished together to make space for the IEEE stuff. Again, these clusters are in the full 64 dimensional space, the 2D image is just for visualization. However, you can see some of the "challenging" groupings being close to each other -- TCS and Algorithms, Medical Imaging & Vision, Control & Robotics, Crypto & Security, Distributed Computing & HPC, Graphics & Visualization.
The full dataset with all the venues, is attached below semantic_clusters3.csv.txt
Another cool property is you can get "meta-clusters" by simply asking for, say 6 clusters. For example, I got:
It's interesting how these "groups" compare to the current groups of "AI, Systems, Theory, Interdisciplinary".
However, because there's no clear delineation of 6 groups, it's easy to get some other alternative with random re-initialization. In such a case, getting the "best" clustering is NP-hard so we're not going there. Here, the Design and Vision groups got merged into CS, while Theory and Security emerged as two new subgroups. But in both cases, databases/data-mining and the two communication clusters remain isolated, suggesting that they're pretty strong groupings.
Additionally, I redid the above with only venues with 35 or more R1 publications (instead of the 12 or more threshold above). This brings me to 349 venues (instead of around 1,200).
There are about 16 clusters here, where repeating the clustering process doesn't change the results much. With more clusters, you'll tend to get more stable results (e.g. if clusters == input size, it'll always give the same result), but searching for the minimum stable clustering seems desirable. There were two "leftover" clusters with a hodge-podge of remaining stuff.
AI & ML & NLP: NIPS, AAAI, ICML, IJCAI, ACL, EMNLP
Vision: CVPR, ICCV, MICCAI, ECCV
Data: KDD, ICDM, SDM, WWW, CIKM, SIGIR
Arch & Design: DAC, DATE, ISCA, ASPLOS, MICRO, HPCA, ICCAD
HPC: IPDPS, SC, ICPP, ICCS, HPDC
Databases: PVLDB, SIGMOD, ICDE, FAST
Communication: SIGCOMM, NSDI, IMC, MobiCom, MobiSys
Networks: INFOCOM, TON, TPDS, ICDCS, ICCCN
Security & OS: CCS, USENIX Security, IEEE S&P, USENIX ATC, NDSS, OSDI, SOSP
SE & PL: ISCE, PLDI, OOPSLA, POPL, CAV, ISSTA, ASE
Theory: SODA, STOC, FOCS, ICALP, EC, SPAA, ITCS, ESA, PODS
Biology: J. Computational Biology, PLoS CB, BCB, RECOMB, TCBB
Graphics: SIGGRAPH, CGF, TVCG, SIGGRAPH Asia
Robotics: ICRA, IROS, IJRR, RSS
HCI: CHI, CSCW, UIST, UbiComp, HCI
EE Leftovers1: TIT, CDC, CRYPTO, ACC, TCC, EUROCRYPT, EuroSys, SIGCSE
EE Leftovers2: TWC, TSP, TSG, TC
It's too bad that all the crypto venues are lumped into "EE leftovers" here. Seems like the granularity is a little too coarse.
@andrewcmyers This is 16 clusters (compared to the 26 above, which is how many CSRankings currently has) so it is a little coarse. But, "EE leftovers" is just a label I had. IEEE TIT, CDC, ACC, Eurocrypt- they're all kind of applied mathematics fields (information theory, control, optimization, cryptography), so maybe "applied math" would have been a better label. You can see in the plot that cyptro is it's own little cluster, but gets that pale-blue label that applies to some leftovers all over the image, in this case the theme seems to be applied mathematics.
Edit: This is just one instance of a decent clustering with 16 groups. All it says is "controls, information theory and cryptography" see cross-authorship (in R1 universities, in the last 10 years, in Emery's list of CS faculty) about as often as Software Engineering & Programming Languages, or Architecture & Design or Artificial Intelligence & Machine Learning & Natural Language Processing. With each of those groups having some sense of distance between each other. There are many other, equally valid groupings.
As this is just 16-way label assignment, it doesn't have any other information. The 2D embedding plots give you another idea of field distances.. you can see that cryptography is often independent. With the 2D embedding, you get more information, for example, that graphics & vision are pretty close, even though they usually separate clusters. The goal is simply to provide some quantitative data to help Emery form more equal groups.
@fycus-tree Some human feedback.
I think applied math is more close to algorithm conferences STOC FOCS SODA. TIT is more close to STOC FOCS SODA, I feel.
But TCC, CRYPTO, EUROCRYPT are more close to applying "applied math" to the specific topic of cryptography.
And EuroSys should stay together with USENIX ATC, SOSP, OSDI.
Yet, this is really a good job!
Stale issue message
Currently, CSRankings uses ACM SIGs as the basis of its categories. However, Emery is left making many editorial decisions. Then only the top 3 conferences from each area are allowed. There are many issues with this, from inclusion of fields (#856 #762 #367 #238 #147 ) to conferences (#427 #366 #361 #263) to metrics (#562 ).
As suggested by @marcniethammer, the "Do R1 universities publish here?" seems a sensible criteria for a field, and so perhaps CSRankings should consider re-organizing the categories based on actual publication information instead of just ACM SIGs (and a few select IEEE friends). @andrewcmyers has also highlighted some issues with the current categories.
I attempted to build some quantitative evaluation. Here's a summary of the basic technical approach
EDIT: Table & Graph removed. I had a bug. Fixed it. See below for the new table & graphic
But it does make a case that data-mining should be split (#238), medical imaging is separate from vision (#611), numerical & scientific computing may be covered by HPC & Theory (#856), distributed systems should be added (#423), and logic & econ are perhaps not yet that notable of "groups" (#857). etc.