Categories based on quantiative measures

emeryberger / CSrankings

A web app for ranking computer science departments according to their research output in selective venues, and for finding active faculty across a wide range of areas.

http://csrankings.org

Other

2.65k stars 3.04k forks source link

Categories based on quantiative measures #859

Closed fycus-tree closed 4 years ago

fycus-tree commented 6 years ago

Currently, CSRankings uses ACM SIGs as the basis of its categories. However, Emery is left making many editorial decisions. Then only the top 3 conferences from each area are allowed. There are many issues with this, from inclusion of fields (#856 #762 #367 #238 #147 ) to conferences (#427 #366 #361 #263) to metrics (#562 ).

As suggested by @marcniethammer, the "Do R1 universities publish here?" seems a sensible criteria for a field, and so perhaps CSRankings should consider re-organizing the categories based on actual publication information instead of just ACM SIGs (and a few select IEEE friends). @andrewcmyers has also highlighted some issues with the current categories.

I attempted to build some quantitative evaluation. Here's a summary of the basic technical approach

collect all venues with at least 12 (the current minimum) R1 universities publishing in the last 10 years (and removing arXiv).
build a matrix of authors (R1 faculty included in CS Rankings) by venues (4326 x 1360)
Perform LDA to get lower-dimensional (32) vectors for each venue, then perform tSNE to get 2D embeddings.
Use DBSCAN on the 2D embeddings to get clusters.
Plot and list! I get 30 clusters

EDIT: Table & Graph removed. I had a bug. Fixed it. See below for the new table & graphic

But it does make a case that data-mining should be split (#238), medical imaging is separate from vision (#611), numerical & scientific computing may be covered by HPC & Theory (#856), distributed systems should be added (#423), and logic & econ are perhaps not yet that notable of "groups" (#857). etc.

andrewcmyers commented 6 years ago

Very cool, great idea. Some random observations:

This seems to make a case for including OOPSLA as the #3 conference in the PL group, which would make sense to me.
The size of a venue, and lack of selectivity, seem to be boosting a number of venues that I am familiar with. It might make sense to rank based on the number of submissions rather than the number of papers, i.e., compensate for acceptance rate.
Some of these clusters don't make complete sense. It is weird to me that OSDI and SOSP are being grouped with PLDI and OOPSLA rather than with USENIX ATC and EuroSys. And TVCG + SPAA seems incoherent.

fycus-tree commented 6 years ago

@andrewcmyers This is just a proof-of-concept, and it's good to see that something very similar to Emery & the ACM SIG curation can be obtained automatically through data already in CSRankings' dataset.

There's a lot of alternative ideas one could use for this analysis (e.g. using other semantic analysis methods, or weighted schemes, along with branching into graph-based formulations).

I wouldn't read into the ordering too much. They're ranked by number of R1 universities publishing there, and there's 1,000 conferences grouped into 30 clusters, I only report the top handful in that table. Handling how selective conferences are needs to added or handled separately, as this process doesn't have that data.
I don't have that data unfortunately. I'm simply using the CS Rankings data and DBLP, which list published papers. Getting selectivity data automatically isn't currently possible.
Yeah. It's stochastic and based on co-authorship. And I'm only using American R1 universities. I'm somewhat unhappy that I can't seen to get good clusterings on the latent space instead of the tSNE space. Until that's resolved, I think there's more room for improvement. The bioinformatics mess in the misc categories is still a problem. I suspect rare venues or the scale-issues with different venues is still causing some issues. The TVCG + SPAA cluster seems to be "large-scale scientific data processing and visualization" if I had to give it a long name. Go look at the recent TVCG, many of those papers are about distributed and parallel computation algorithms.
This is probably "too many clusters". If you do "too few" clusters, you tend to get things like ML & CV & AI & Robotics colliding in various ways (e.g. CV & Graphics and then ML & Robotics, or AI & Robotics). I suspect this is because, as Emery's nice pie-charts on CSRankings show, most top faculty publish across a few different "fields", so finding an algorithmic tipping point, where clusters are neither too big or too small may not be easy. I suspect we'll either get merging, spurious categories, or both (as I obtained above). It may be quite hard to find something that's efficient to compute and doesn't suffer from this issue.
Following some soft notion of stability under perturbation, under a lot of these different clustering hyper-parameters, you still see the same types of trends, no matter the settings. That is, all of them are wrong in small ways, but the things they all agree on may be right. For example, (1) you'll usually get independent categories for some fields that aren't independent right now (bioinformatics, medical imaging, machine learning, data-mining, distributed & parallel computing) (2) When ML merges with another category, it's almost always CV or AI, not data-mining (even though ML & data-mining are currently combined on CSRankings). (3) You never get something like "combine all SIAM journals" into a single field (which means my math in #856 is misleading). (4) Certain areas always do cluster together (like all the HCI conferences), so perhaps they shouldn't be split into sub-fields (#147 may be wrong).

marcniethammer commented 6 years ago

@fycus-tree thanks for doing this analysis. This looks very interesting to me. I particularly like that this is entirely data-driven and that (as you point out) it produces many clusters that seem to be consistent with expectations and also with what Emery created. I.e., the data seems to support many of the decisions that were made to create the current areas.

Since my research area is medical image analysis I can give you a little bit of extra context/interpretation for your clustering results in this regard:

What is interesting about your clustering result is that it indeed automatically finds MICCAI and IPMI, which are the two best conferences in our field that I have been arguing for (#611).
It also identifies journals that fit very well into medical imaging (or maybe more precisely medical image computing/medical image analysis): NeuroImage and IEEE TMI. Both journals are common places to publish for our community and have pretty high impact factors (NeuroImage: 5.8 (5 year impact factor, 6.9); IEEE TMI: 3.9). This points to an issue with respect to the current policy of only including conference publications in csrankings. While it is my understanding that in many CS fields the primary way of publishing these days is indeed via conference publications, this is only partially the case for medical image analysis. Journals are still very important for us. That being said, even if one just takes MICCAI (and ignores IPMI and the journals) this already creates a bigger area than quite a few areas that are already included in csrankings (#611). I think your data correctly identifies that medical image analysis would get shortchanged by the conference-paper-only policy were it to be included. I suspect this may also affect other areas.
I am bit surprised that IEEE TIP got put into the mix as this is a journal focusing more generally on image processing. But I suppose it is possible that TIP is popular for many researchers in our area as well. What I would have expected to see instead is the journal Medical Image Analysis (MedIA).
It is very interesting to see (as you also point out) that medical image analysis is not simply part of CV, but forms its own separate cluster. This is entirely in line with my expectations. There are quite a few people in our community that also publish in CV and ML venues, but the areas are largely separate. This was also pointed out in #762.

As far as your method is concerned: if you say "build a matrix of authors (R1 faculty included in CS Rankings) by venues (4326 x 1360)" is this really just a binary matrix indicating if an author published in a venue or not? Or did you create a weighted matrix (i.e., where the entries are the number of publications or maybe even the co-author adjusted scores that Emery used)? Is the second option what you mean by a weighted scheme in your post above?

I like your comment regarding stability under perturbation. Any ranking will, of course, always be dependent on how scores are weighted and categories are chosen, but I think some notion of stability under perturbation is a good idea. Is it easily possible with the code you have now to compute what would happen to the rankings if one were to change the areas as you suggest? (And maybe use the geometric mean Emery uses as well as a simple additive approach as we discussed before.) That would be very interesting to see.

Thanks for this work. Very nice!

fycus-tree commented 6 years ago

@marcniethammer

To be clear: this 30 clusters over 1,200 venues (conferences and journals), so there's an average of 40 venues assigned to each label. The size of the dots in the scatter-plot is scaled by the number of universities that publish in that conference. I'm simply reporting the top handful as measured by R1 universities (the image has the un-edited top 3, the table has some minor curating on my part. For example, in Networks, I cut out IEEE/ACM Transactions on Networking to make sure I had space for SIGCOMM and ICDCS in my "top" list, as they were "next up" and have been discussed in #423).
If you want the top, say 12, for "medical", which is really, medical/IEEE image stuff: MICCAI, IEEE-TIP, NeuroImaging, IEEE-TMI, IEEE-TBME, IEEE-TMM, IEEE-TCSVT, Medical Image Analysis, IPMI, ICME, IEEE Access, IEEE-SPM. But there's 30 other venues assigned to this label (including, for example, ACCV Workshops). To order within a label, I'm going by number of R1 university publications in the last 10 years. That's how I got that ordered list of 12, they're the top 12 of 40, in order. There are, of course, other metrics one could use to order within a label class.
The matrix stuff is all fairly standard topic modeling (brief lecture notes). The elements are author publication counts in that venue, so anywhere from 0 to 134. LDA is typically done on document-term matrices with word-counts. I'm just doing the analysis assuming that conferences are words and authors are documents.
I think this was my error, I should have flipped it so conferences are documents and authors are words. I just fixed it and did some quick analysis ... I can now do clustering in semantic space and get good clusters! I get much "blobbier" results as well, which is a good sign. Clustering works in either space now, but I think semantic clustering is the way to go. Some of the clusters make much more sense. See new post below

fycus-tree commented 6 years ago

okay okay @andrewcmyers @marcniethammer

With the bug fix above, a bit more LDA dimensions (64), running more iterations, using k-means++ clustering in semantic space (so I have to pick a cluster number) and picking 26 clusters (the same as in CS Rankings now); I get this. The area label is just my own curation, it's probably imperfect, but the groups themselves are automatic.

Area	Venue 1	Venue 2	Venue 3	Venue 4	Venue 5
AI & ML	NIPS	AAAI	ICML	IJCAI	AISTATS
CV	CVPR	ICCV	ECCV	PAMI	WACV
Data Mining	KDD	ICDM	SDM	CIKM	IEEE-TKE
NLP	ACL	EMNLP	HLT-NAACL	LREC	GECCO
Medical Imaging	MICCAI	IEEE-TIM	NeuroImage	IEEE-TMI	Neurocomputing
-	-	-	-	-	-
Arch & Design	DAC	DATE	ISCA	ASPLOS	MICRO
Network & Mobile	ACM/IEEE TON	SIGCOMM	NSDI	IMC	MobiCom
Communication	INFOCOM	IEEE-TMC	ICDCS	Computer Networks	IEEE-JSAC
Security	CCS	USENIX Security	IEEE S&P	NDSS	ACSAC
HPC	IPDPS	SC	J. Parallel Distrib. Comput.	ICPP	ICCS
Cloud & Cluster	CCPE	CLUSTER	CCGRID	HPDC	MASCOTS
Database & IR	PVLDB	WWW	SIGMOD	Commun. ACM	ICDE
PL	PLDI	OOPSLA	POPL	ICFP	ECOOP
OS	USENIX ATC	DSN	EuroSys	SOSP	OSDI
Software Engineering	ICSE	ISSTA	ASE	SIGSOFT FSE	IEEE-TSE
-	-	-	-	-	-
Algorithms	SODA	STOC	FOCS	SIAM J. Comput.	ICALP
Theory	Theor. Comput. Sci.	PODS	WINE	PODC	J. Comput. Syst. Sci.
Crypto	CRYPTO	TCC	EUROCRYPT	ASIACRYPT	J. Cryptology
Information Theory	IEEE-TIT	IEEE-TSP	Allerton	CISS	IEEE-TCOM
-	-	-	-	-	-
Comp Bio	Journal of Computational Biology	PLoS Computational Biology	AMIA	BCB	RECOMB
Graphics	SIGGRAPH	Comput. Graph. Forum	IEEE-TVCG	SIGGRAPH Asia	IEEE CG&A
HCI	CHI	CSCW	ICWSM	UIST	UbiComp
Robotics	ICRA	IROS	IJRR	RSS	IEEE T-RO
-	-	-	-	-
Control	CDC	ACC	IEEE-TACON	HSCC	ROBIO
Smart Grid	IEEE-TPDS	IEEE-TC	IEEE-TIFS	IEEE-TSG	SmartGridComm
Wireless	GLOBECOM	ICC	ICCCN	IEEE-TWC	MASS

kmeans_clusters2

Clearly there's a lot of IEEE stuff in the dataset. And with the same number of categories as now, some CS stuff gets squished together to make space for the IEEE stuff. Again, these clusters are in the full 64 dimensional space, the 2D image is just for visualization. However, you can see some of the "challenging" groupings being close to each other -- TCS and Algorithms, Medical Imaging & Vision, Control & Robotics, Crypto & Security, Distributed Computing & HPC, Graphics & Visualization.

The full dataset with all the venues, is attached below semantic_clusters3.csv.txt

fycus-tree commented 6 years ago

Another cool property is you can get "meta-clusters" by simply asking for, say 6 clusters. For example, I got:

Vision: Computer Vision, Medical Imaging, Computational Photography, etc.
CompSci: Theory, AI, OS, PL, Software Engineering, ML, Robotics, HCI, NLP, Graphics, etc.
CS Comm: Networking, Mobile, Metrics
EE Comm: Information Theory, Distributed Computing, Wireless
Data: Data-mining, Information Retrieval, WWW, Databases
Design: Design automation, CAD, VLSI

It's interesting how these "groups" compare to the current groups of "AI, Systems, Theory, Interdisciplinary". 5_v1

However, because there's no clear delineation of 6 groups, it's easy to get some other alternative with random re-initialization. In such a case, getting the "best" clustering is NP-hard so we're not going there. Here, the Design and Vision groups got merged into CS, while Theory and Security emerged as two new subgroups. But in both cases, databases/data-mining and the two communication clusters remain isolated, suggesting that they're pretty strong groupings. 5_v2

fycus-tree commented 6 years ago

Additionally, I redid the above with only venues with 35 or more R1 publications (instead of the 12 or more threshold above). This brings me to 349 venues (instead of around 1,200).

There are about 16 clusters here, where repeating the clustering process doesn't change the results much. With more clusters, you'll tend to get more stable results (e.g. if clusters == input size, it'll always give the same result), but searching for the minimum stable clustering seems desirable. There were two "leftover" clusters with a hodge-podge of remaining stuff.

AI & ML & NLP: NIPS, AAAI, ICML, IJCAI, ACL, EMNLP
Vision: CVPR, ICCV, MICCAI, ECCV
Data: KDD, ICDM, SDM, WWW, CIKM, SIGIR
Arch & Design: DAC, DATE, ISCA, ASPLOS, MICRO, HPCA, ICCAD
HPC: IPDPS, SC, ICPP, ICCS, HPDC
Databases: PVLDB, SIGMOD, ICDE, FAST
Communication: SIGCOMM, NSDI, IMC, MobiCom, MobiSys
Networks: INFOCOM, TON, TPDS, ICDCS, ICCCN
Security & OS: CCS, USENIX Security, IEEE S&P, USENIX ATC, NDSS, OSDI, SOSP
SE & PL: ISCE, PLDI, OOPSLA, POPL, CAV, ISSTA, ASE
Theory: SODA, STOC, FOCS, ICALP, EC, SPAA, ITCS, ESA, PODS
Biology: J. Computational Biology, PLoS CB, BCB, RECOMB, TCBB
Graphics: SIGGRAPH, CGF, TVCG, SIGGRAPH Asia
Robotics: ICRA, IROS, IJRR, RSS
HCI: CHI, CSCW, UIST, UbiComp, HCI
EE Leftovers1: TIT, CDC, CRYPTO, ACC, TCC, EUROCRYPT, EuroSys, SIGCSE
EE Leftovers2: TWC, TSP, TSG, TC

16_or_bust.csv.txt

andrewcmyers commented 6 years ago

It's too bad that all the crypto venues are lumped into "EE leftovers" here. Seems like the granularity is a little too coarse.

fycus-tree commented 6 years ago

@andrewcmyers This is 16 clusters (compared to the 26 above, which is how many CSRankings currently has) so it is a little coarse. But, "EE leftovers" is just a label I had. IEEE TIT, CDC, ACC, Eurocrypt- they're all kind of applied mathematics fields (information theory, control, optimization, cryptography), so maybe "applied math" would have been a better label. You can see in the plot that cyptro is it's own little cluster, but gets that pale-blue label that applies to some leftovers all over the image, in this case the theme seems to be applied mathematics.

Edit: This is just one instance of a decent clustering with 16 groups. All it says is "controls, information theory and cryptography" see cross-authorship (in R1 universities, in the last 10 years, in Emery's list of CS faculty) about as often as Software Engineering & Programming Languages, or Architecture & Design or Artificial Intelligence & Machine Learning & Natural Language Processing. With each of those groups having some sense of distance between each other. There are many other, equally valid groupings.

As this is just 16-way label assignment, it doesn't have any other information. The 2D embedding plots give you another idea of field distances.. you can see that cryptography is often independent. With the 2D embedding, you get more information, for example, that graphics & vision are pretty close, even though they usually separate clusters. The goal is simply to provide some quantitative data to help Emery form more equal groups.

weikengchen commented 5 years ago

@fycus-tree Some human feedback.

I think applied math is more close to algorithm conferences STOC FOCS SODA. TIT is more close to STOC FOCS SODA, I feel.

But TCC, CRYPTO, EUROCRYPT are more close to applying "applied math" to the specific topic of cryptography.

And EuroSys should stay together with USENIX ATC, SOSP, OSDI.

Yet, this is really a good job!

github-actions[bot] commented 4 years ago

Stale issue message