UoMResearchIT / RSESkillsGraph

A Python web application for visualising the skills of RSEs in ResearchIT
https://rseskillsgraph.itservices.manchester.ac.uk/
Apache License 2.0
5 stars 4 forks source link

Wikipedia redirect pages get their own entries in the skills graph #101

Open ianhinder opened 2 months ago

ianhinder commented 2 months ago

If skills "VS Code" and "Visual Studio Code" are both used in the list of skills, we get nodes for each, even though VS Code is a Wikipedia Redirect Page.

We should prevent people from including skills which are redirects, and instead require them to use the target page title.

ianhinder commented 2 months ago

The problem with adding this check is that it has generated 76 failures on the existing data:

Canonicalised titles:
  .Net -> .net (disambiguation)
  ARM -> Arm (disambiguation)
  Air Quality -> Air pollution
  Angular JS -> AngularJS
  Applications Architecture -> Applications architecture
  Artifical Intelligence -> Artificial intelligence
  Artifical intelligence -> Artificial intelligence
  Artificial neural network -> Neural network (machine learning)
  Association Rule Learning -> Association rule learning
  C++ (programming language) -> C++
  CAD -> Computer-aided design
  Cascading Style Sheets -> CSS
  Climate Change -> Climate change
  Code optimization -> Program optimization
  Cyber-physical system -> Cyber–physical system
  Data Mining -> Data mining
  Data Science -> Data science
  Data cluster -> Disk sector
  Data transformation -> Data transformation (computing)
  Data visualization -> Data and information visualization
  Database Design -> Database design
  Design patterns -> Design pattern
  Desktop app -> Application software
  Environmental Science -> Environmental science
  Estimation Theory -> Estimation theory
  GLSL -> OpenGL Shading Language
  Geospatial data -> Geographic data and information
  Github -> GitHub
  Haskell (programming language) -> Haskell
  HoloLens -> Microsoft HoloLens
  Igor Pro -> IGOR Pro
  Java Spring -> Spring Framework
  Javascript -> JavaScript
  Jquery -> JQuery
  Jupyter Notebook -> Project Jupyter
  Link Analysis -> Link analysis
  Machine Learning -> Machine learning
  Mathematical modelling -> Mathematical model
  Microcontrollers -> Microcontroller
  Mobile Development -> Mobile app development
  Model View Controller -> Model–view–controller
  Molecular dynamics simulation -> Molecular dynamics
  Monte Carlo methods -> Monte Carlo method
  NVIDIA CUDA -> CUDA
  NVIDIA Quadro Plex -> Nvidia Quadro Plex
  Neo4J -> Neo4j
  Neural networks -> Neural network
  Neuromorphic engineering -> Neuromorphic computing
  Numerical methods -> Numerical analysis
  Numerical software -> Numerical analysis
  OpenMPI -> Open MPI
  Optimisation -> Mathematical optimization
  Optimization -> Mathematical optimization
  Powershell -> PowerShell
  Product lifecycle management -> Product lifecycle
  Real-time networking -> Real-time computing
  Recommender Systems -> Recommender system
  Representational state transfer -> REST
  SASS -> Sass
  Scripting -> Script
  Sensor networks -> Wireless sensor network
  Sequential Pattern Mining -> Sequential pattern mining
  Shaders -> Shader
  Software Testing -> Software testing
  Solar Energy -> Solar energy
  Statistical physics -> Statistical mechanics
  Unit Testing -> Unit testing
  VS Code -> Visual Studio Code
  Verification and Validation -> Verification and validation
  Version Control -> Version control
  Wearable computing -> Wearable computer
  Web search engine -> Search engine
  Win32 -> Windows API
  Windows SDK -> Microsoft Windows SDK
  XCode -> Xcode
  Xamarin.Forms -> Xamarin

What I don't understand is why some of these only differ in case, and were not picked up before. The change I made was to follow redirects when canonicalising a page title. So now, if the page is a redirect, the canonicalised title will not match the input title, and it will fail the test. But I thought that the whole point of canonicalisation was that at least the case was corrected. So there shouldn't have been any titles which only differed in case from their canonical form.

The entry for "Arm" is interesting - wikipedia of course does have a page about arms; only so many things we can check automatically!

Also, ".net" is a link to the top level DNS domain, whereas ".Net" is a disambiguation page which includes .NET_Framework as one of the options.

So we probably need to go through each of these and fix them. It's also possible there are wrong pages, such as "arm" in the list, that would have to be inspected manually to check that they are correct.