Open ianhinder opened 1 month ago
The problem with adding this check is that it has generated 76 failures on the existing data:
Canonicalised titles:
.Net -> .net (disambiguation)
ARM -> Arm (disambiguation)
Air Quality -> Air pollution
Angular JS -> AngularJS
Applications Architecture -> Applications architecture
Artifical Intelligence -> Artificial intelligence
Artifical intelligence -> Artificial intelligence
Artificial neural network -> Neural network (machine learning)
Association Rule Learning -> Association rule learning
C++ (programming language) -> C++
CAD -> Computer-aided design
Cascading Style Sheets -> CSS
Climate Change -> Climate change
Code optimization -> Program optimization
Cyber-physical system -> Cyber–physical system
Data Mining -> Data mining
Data Science -> Data science
Data cluster -> Disk sector
Data transformation -> Data transformation (computing)
Data visualization -> Data and information visualization
Database Design -> Database design
Design patterns -> Design pattern
Desktop app -> Application software
Environmental Science -> Environmental science
Estimation Theory -> Estimation theory
GLSL -> OpenGL Shading Language
Geospatial data -> Geographic data and information
Github -> GitHub
Haskell (programming language) -> Haskell
HoloLens -> Microsoft HoloLens
Igor Pro -> IGOR Pro
Java Spring -> Spring Framework
Javascript -> JavaScript
Jquery -> JQuery
Jupyter Notebook -> Project Jupyter
Link Analysis -> Link analysis
Machine Learning -> Machine learning
Mathematical modelling -> Mathematical model
Microcontrollers -> Microcontroller
Mobile Development -> Mobile app development
Model View Controller -> Model–view–controller
Molecular dynamics simulation -> Molecular dynamics
Monte Carlo methods -> Monte Carlo method
NVIDIA CUDA -> CUDA
NVIDIA Quadro Plex -> Nvidia Quadro Plex
Neo4J -> Neo4j
Neural networks -> Neural network
Neuromorphic engineering -> Neuromorphic computing
Numerical methods -> Numerical analysis
Numerical software -> Numerical analysis
OpenMPI -> Open MPI
Optimisation -> Mathematical optimization
Optimization -> Mathematical optimization
Powershell -> PowerShell
Product lifecycle management -> Product lifecycle
Real-time networking -> Real-time computing
Recommender Systems -> Recommender system
Representational state transfer -> REST
SASS -> Sass
Scripting -> Script
Sensor networks -> Wireless sensor network
Sequential Pattern Mining -> Sequential pattern mining
Shaders -> Shader
Software Testing -> Software testing
Solar Energy -> Solar energy
Statistical physics -> Statistical mechanics
Unit Testing -> Unit testing
VS Code -> Visual Studio Code
Verification and Validation -> Verification and validation
Version Control -> Version control
Wearable computing -> Wearable computer
Web search engine -> Search engine
Win32 -> Windows API
Windows SDK -> Microsoft Windows SDK
XCode -> Xcode
Xamarin.Forms -> Xamarin
What I don't understand is why some of these only differ in case, and were not picked up before. The change I made was to follow redirects when canonicalising a page title. So now, if the page is a redirect, the canonicalised title will not match the input title, and it will fail the test. But I thought that the whole point of canonicalisation was that at least the case was corrected. So there shouldn't have been any titles which only differed in case from their canonical form.
The entry for "Arm" is interesting - wikipedia of course does have a page about arms; only so many things we can check automatically!
Also, ".net" is a link to the top level DNS domain, whereas ".Net" is a disambiguation page which includes .NET_Framework as one of the options.
So we probably need to go through each of these and fix them. It's also possible there are wrong pages, such as "arm" in the list, that would have to be inspected manually to check that they are correct.
If skills "VS Code" and "Visual Studio Code" are both used in the list of skills, we get nodes for each, even though VS Code is a Wikipedia Redirect Page.
We should prevent people from including skills which are redirects, and instead require them to use the target page title.