breck7 / pldb

PLDB: a Programming Language DataBase
https://pldb.io
729 stars 99 forks source link

Improve list data at read-time #390

Closed breck7 closed 1 year ago

breck7 commented 1 year ago

As @tif-calin and @SRS-WRKS have pointed out, there's a number of places where list columns (ie Origin Community, CompilesTo, etc), are handled incorrectly at read time:

https://github.com/breck7/pldb/issues/348

Let's fix this site wide.

breck7 commented 1 year ago

I actually think the way to do it might just be one-offs. Thinking about how I do it in datascience, it depends on the analysis I'm doing, but usually I would one hot encode columns like this. The problem is then we'd have a giant CSV with 10,000 columns. So perhaps we provide some simple scripts or NPM/R/PyPI package with methods for quick access to ready to go data depending on the analysis to be done.