breck7 / pldb

PLDB: a Programming Language DataBase
https://pldb.io
735 stars 101 forks source link

Columns with multiple answers are problematic. We should consider using sets as a data structure instead of strings #349

Closed tif-calin closed 5 months ago

tif-calin commented 1 year ago

Note see related issue #348

This is the current frequency count for the compilesTo column:

  javascript: 32,
  ocaml: 1,
  "c cpp objective-c javascript": 1,
  html: 1,
  llvmir: 1,
  c: 8,
  lua: 2,
  "arm-templates": 1,
  python: 1,
  sql: 2,
  cpp: 1,
  wasm: 2,
  latex: 2,
  "javascript java php python r ruby scheme": 1,
  "json yaml toml xml": 1,
  php: 2,
  csa: 1,
  "s-expressions": 1,
  "praxis-lang": 1,
  "x86-64-isa arm": 1,
  fortran: 1

One major issue that comes up immediately is how javascript and javascript java php python r ruby scheme are considered to be two separate answers.

Clearly this isn't a useful distinction to make. There's a few possible solutions to this. My least favorite would be to automatically generate a bunch of compilesTo{lang} boolean columns. I think a better solution would be to somehow utilize Sets for fields like this. Inclusion in sets is O(1) and we don't have to worry about order or duplicated values.

Another column for which these criticisms hold is country which current has these freq counts:

  "United States": 862,
  Netherlands: 10,
  Switzerland: 27,
  Canada: 59,
  Japan: 28,
  "United States and United Kingdom": 5,
  "United Kingdom": 99,
  Brazil: 8,
  France: 46,
  Sweden: 17,
  Italy: 20,
  "Argentina and Germany and Turkey": 1,
  Various: 68,
  "United States and France": 3,
  "United States and Spain and France": 2,
  Unknown: 85,
  Germany: 63,
  Norway: 13,
  "New Zealand": 12,
  "United States and United Kingdom and Belarus": 1,
  "Switzerland and Germany and United States": 1,
  Brasil: 1,
  "United Kingdom and Germany": 2,
  Ecuador: 1,
  Argentina: 4,
  "United States and Portugal": 1,
  "United Kingdom and United States and Switzerland": 1,
  Poland: 6,
  "Norway and United States": 1,
  Austria: 4,
  Russia: 11,
  Cananda: 1,
  Australia: 22,
  "Czech Republic and Germany": 1,
  Denmark: 8,
  "United States and New Zealand": 2,
  "Australia and Sweden and United States": 1,
  "Germany and United Kingdom": 2,
  "Scotland, United Kingdom": 2,
  China: 12,
  "The Netherlands": 17,
  Finland: 3,
  Georgia: 1,
  "United States and Germany": 1,
  "Australia and Sweden": 1,
  "China and Japan": 1,
  "Sweden and Japan and United States and France and Germany and Switzerland": 2,
  "Dominican Republic": 2,
  "United States and Ireland and Norway and India": 1,
  "North Cyprus": 1,
  England: 19,
  India: 4,
  "Austria and China": 1,
  "United States and China": 2,
  "Russia and Ukraine amd Lithuania and Serbia": 1,
  "United Kingdom and United States": 2,
  Scotland: 3,
  Slovenia: 5,
  "Denmark and New Zealand and United Kingdom": 1,
  "United Arab Emirates": 1,
  Nigeria: 3,
  Korea: 1,
  Iceland: 1,
  "United States and Germany  and France and Spain": 1,
  "Czech Republic": 4,
  "United States and Kazakhstan": 1,
  "United States and Switzerland": 1,
  "The Netherlands and United States and Turkey and France": 1,
  "European Union": 1,
  "United States and Denmark": 2,
  "Australia and Belgium and France and Sweden and Indonesia": 1,
  Israel: 9,
  Turkey: 2,
  "Czech Republic and New Zealand and United States": 1,
  "United Kingdom and Canada": 1,
  "South Korea": 3,
  "Switzerland and United Kingdom": 1,
  Paraguay: 1,
  "United Kingdom and Switzerland": 1,
  Portugal: 4,
  "United States and Canada": 1,
  "Various countries in Western Europe": 2,
  Taiwan: 1,
  Serbia: 1,
  Ireland: 4,
  "United States and Germany and Canada": 1,
  Belgium: 2,
  "Spain and United States": 1,
  "South Korea and United States": 1,
  " Canada": 1,
  Bulgaria: 1,
  "Japan or Korea": 1,
  Mexico: 1,
  Uruguay: 1,
  "Germany and Spain": 1,
  "Germany and Canada": 1,
  Uzbekistan: 1,
  "Japan and Canada": 1,
  "United States and Finland": 1,
  "Various\t": 1,
  "United States and Spain": 1,
  Thailand: 1,
  Malta: 1,
  "United States and United Kingdom and France": 1,
  "United State": 1,
  "Italy and The Netherlands": 1,
  "Canada and China": 1,
  Slovakia: 1,
  "Japan and Germany and United Kingdom and United States": 1,
  "Spain and Germany": 1,
  "France and Poland": 1,
  "England and Wales": 1,
  "The Netherlands and United Kingdom": 1,
  "United States and Germany and Norway": 1,
  " Czech Republic": 1,
  Hungary: 3,
  Singapore: 2,
  "The Czech Republic": 2,
  "United States and France and Germany and Japan": 1,
  "Canada and Portugal": 1,
  Italia: 1,
  "Romania and Canada": 1,
  Cyprus: 1,
  "Scotland and The Netherlands and United States": 1,
  "United States and Brazil": 1,
  "Spain and Italy": 1,
  "United States and Sweden": 1,
  "Portugal and England": 1,
  Bahrain: 1,
  "Canada and England": 1,
  Greece: 1,
  "Australia and Canada": 1,
  "United States and Israel": 1,
  Spain: 2,
  "Canada and Australia": 1,
  "France and United States": 1,
  "Italy and United Kingdom": 1,
  "Sweden and Netherlands and United Kingdom": 1,
  "Canada and South Africa": 1,
  "The Netherlands and United States": 1,
  "United States and South Korea": 1,
  "Canada and Germany": 2,
  "Taiwan or R.O.C": 1,
  "Australia and Germany": 1

I don't think using and xor or or as separators is a sustainable long term solution lol

breck7 commented 1 year ago

This is a great issue @tif-calin ! We can do this upstream in JTree/Grammar so CancerDB.com et al will benefit to.

ghost commented 1 year ago

npm run test already has tests to see if files title has valid pldbId. May be additional tests could be added to it ??

breck7 commented 5 months ago

We now have "computed measures", so it would be very easy to add a column like "numberOfLanguagesThatCompileToThis" and write a tiny Javascript method to compute it.

I expect we will soon have a lot more data on relationships between languages, such as compilesTo, and then it might be more worth it to add those kind of computed columns. Closing for now.