daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

readFrame :: using string as column data type #834

Closed auge closed 1 month ago

auge commented 1 month ago

(How) is it currently possible to load a dataframe with string as column type?

According to https://daphne-eu.github.io/daphne/FileMetaDataFormat/, there is only numeric data possible?

can we have support for valueType: string?

data.csv

Algiers,3.4
St. John's,4.3
Dodoma,26.3
Toliara,17.0
Yellowknife,4.0
Batumi,24.5
Istanbul,31.9
Tampa,41.6
Gjoa Haven,-1.3
Paris,18.2

data.csv.meta

{
  "numRows": 10,
  "numCols": 2,
  "schema": [
    {
      "label": "city",
      "valueType": "string"
    },
    {
      "label": "temperature",
      "valueType": "f64"
    }
  ]
}

daphne script:

path = "data.csv";
data = readFrame(path);
print(data);
pdamme commented 1 month ago

Hi @auge, it is currently not possible to read string data from files, but this feature is already WIP and will be added soon.

PR #797, which is about to be finalized and merged, will bring support for reading CSV files into matrices of string value type. These matrices (or individual columns) can then be processed with some basic string operations (e.g., concatenation, lower/upper case) or converted to numerical data (e.g., through number parsing, dictionary coding, or one-hot encoding).

As a follow-up, we're already working on support for frames with string columns and reading CSV files with string columns into a frame directly.

In the meantime, a work-around can be to convert the string data to numbers with some external tool/script (e.g., through dictionary coding) and to read just the numerical data into DAPHNE.

pdamme commented 1 month ago

This issue is closed now. In the meta data file, one has to specify "valueType": "str" for string columns. The documentation will be updated shortly.