fslaborg / datasets

A data source for example datasets for all kinds of data science
https://fslab.org/datasets
MIT License
6 stars 5 forks source link

Proposal / design sketch for providing easy-to-consume classic datasets #9

Open mathias-brandewinder opened 1 year ago

mathias-brandewinder commented 1 year ago

This recent discussion got me thinking about how a collection of easy-to-use from F# datasets could look like.

My thinking / design direction:

I took a stab at it here. This is a sketch, but hopefully conveys where I would tend to go.

I'd be interested in hearing thoughts on the approach, and assuming there is interest, what changes / adjustments would be needed to get that into this repository instead of mine :) // @kMutagene

One thing I did not do is expose datasets as data frames. I have mixed feelings on the question:

kMutagene commented 1 year ago

I absolutely agree that we can extend this repo to distribute the datasets, however we have to be careful regarding dataset and package license. Regarding your approach, it seems like the data is downloaded at least once from github user content? i think we can just distribute it with a nuget package, e.g. as EmbeddedRessources. That could have the disadvantage of a large package depending on how many datasets we include though.

One thing I did not do is expose datasets as data frames. I have mixed feelings on the question:

A few thoughts on this coming from my everyday experience:

mathias-brandewinder commented 1 year ago

With some delay, a few reactions / thoughts:

it seems like the data is downloaded at least once from github user content? i think we can just distribute it with a nuget package, e.g. as EmbeddedRessources

Yes, that's why I went the route of "download once and cache it". If the goal is to keep adding datasets over time, it could result in a massive package, and I imagine that you would use the package because you want to use a specific dataset, not all of them. Download-and-cache also means that if a new dataset is added, and a new package is released, whatever dataset you used in the past would still be available, without having to download that dataset again.

we have to be careful regarding dataset and package license

Absolutely agree. I sketched it out as

type Dataset<'T> =
    abstract member Read: unit -> seq<'T>
    abstract member Source: string
    abstract member License: string

... but I think something like Dataset<'Data,'Metadata> could work, to provide a plug so every dataset can be sourced and properly attributed / licenced.

Working in the record world means you have to create new types everytime.

Interesting, would like to see an example. Typically I like to have the record as skinny / small as possible, and I use either functions, or methods on the record for any composite, using type extensions.

I could see this as an issue if you wanted to, say, append a value to each record that is the average for a group - which is not something you'd want to dynamically recompute every time you need it. That being said, in that case I tend to not make a wider record, but rather create a map of the values by group, and access the group aggregates by record key.

If you are working with datasets that can have missing values, everything has to be an Option, leading to very awkward records

Curious now, how do you go about handling/representing missing data, if not with an option?

kMutagene commented 1 year ago

I think in general with this discussion, I might fall into the trap of thinking that my workflow for things like this is the standard. That might very well not be the case, so let me get a little more verbose before there are misunderstandings:

To understand where I am coming from here, it is necessary to get a little into the types of data I usually work with:

There are in general two large categories of data that I am working in computational biology:

  1. A table that contains measurements for biological entities (e.g. genes, proteins) across multiple measurements. A single measurement for all observed entities is a column. These tables often have hundreds of columns, because each measurement is typically replicated and repeated several times. Also, due to the nature of the measurements, there is often missing data, e.g. due to time-resolution of detector technology.

  2. Multiple metadata annotation sources for a measurement or biological entity, which I often need to aggregate for subsequent feature extraction. The individual datasets might have few columns, but they are usually very different, and the aggregated dataset has again dozens to hundreds of columns. Also, because available scientific knowledge is often limited, not every annotation source has entries for each target key, so we again have potential missing values in each column.

So you see that my datasets are usually

  1. quite large in both column and row dimensions (for reference, I currently work with a 15k x 45k measurement matrix for a large scale meta analysis)
  2. Always have missing values

And I

  1. Often add large amounts of columns to an existing dataset (which in the record type world is equivalent to adding fields)

So while records are an easy way of modelling small, well defined datasets, I often prefer using dataframes, because I do not have to do the type modelling beforehand.

Also, missing values are just less hassle to work with in dataframes, because the issue is also mitigated until you actually want to work with them. For record type modelling, you would again have to know each column that has missing values to correctly model and parse the dataset BEFORE actually taking a look at it and exploring it. Missing value handling only becomes an issue when you actually want to do something with the data, at which point you have functions of handling them just the same as Option<'T>.

So let me try to condense my points to a few sentences:

smoothdeveloper commented 1 year ago

@mathias-brandewinder, how do you deal with datasets that have say several thousands of columns? And how do you bring data from "chaos emergent data soup verse" to "tidy verse" to "sleek verse" if you can't explore and process the data in a way which doesn't force you to have everything explicitly typed upfront?

I feel the kind of examples used for records works well for datasets that are just few dozens of properties, or when the bridge between data science workload, and domain code are small gaps only.

I've had a brief exposure to python/pandas/numpy/scikitlearn and overall, I feel it is more flexible for data exploration and also, assembling massive datasets with little churn overall, even though it doesn't convey towards model that feels like native F#.

I'd never do what I'm currently working on (many thousands of columns) by designing records and composing them upfront, while everything is up for debate what the model will be fed with, the approach you suggest is good when the model and ML workloads are closer to domain logic and other systems, but when it mostly feels "funnelling data into a pile of linear algebra" (https://xkcd.com/1838/) till something looks right, it may not be productive to force upfront design and static typing, mostly depending the nature of the dataset and processing.

I'm just in early stage of practice, but this came apparent, and I'm someone with strong lean towards static type checking, etc.

Regarding download and cache, only pay for what you use, versus a nuget packages, I see pros and cons, I'm geared towards:

This technique requires some infrastructure, it allows people to modify the data locally easily (as they pick the folder), and it works out of the box just by referencing the nuget, even if offline once the nuget is cached.

I don't feel we want the data sets to turn into large thing, and for such large datasets, which I'm not ruling out, we can also provide it in the way you propose.

Since each dataset will have a bit of F# types / data definitions around it, it makes sense that the nuget package would be self contained, nuget will deal with packaging, caching, with no code.

Effort can then be put on the override folder infrastructure and the most important, the metadata, and QA infrastructure of the repository to make it easy to add new datasets, and prevent situation where there is mismatch with embeded resource and the F# types / data definitions.

I think having a metadata driven infrastructure, with code generation, rather than writing Dataset<'Data,'Metadata> types ourselves, is also key for long term maintenance and evolution of the repository.