Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
761 stars 48 forks source link

readDelimiter variant for Regex as delimiter #746

Open dave08 opened 1 week ago

dave08 commented 1 week ago

Maybe since this is a function to especially read delimeters, it might be useful to have an override that takes in a Regex as a delimiter... this might be used for command line output tables that are usually space separated but sometimes inside a column value there might be a single space, so I need to use "\s\s+" to correctly read it in.

koperagen commented 1 week ago

Hi. Library we're using now only has String and Char options for delimiter. Is your file a CSV/TSV or just a plain txt with some special format you want to parse? image

dave08 commented 1 week ago

Say I have (output from kubectl get namespaces):

NAME                     STATUS   AGE      LABELS
argo-events              Active   2y77d    app.kubernetes.io/instance=argo-events,kubernetes.io/metadata.name=argo-events
argo-workflows           Active   2y77d    app.kubernetes.io/instance=argo-workflows,kubernetes.io/metadata.name=argo-workflows
argocd                   Active   5y18d    kubernetes.io/metadata.name=argocd
beta                     Active   4y235d   kubernetes.io/metadata.name=beta

Then I have multiple spacess as delimiters...

In some command line outputs, I have two words in one column:

NAME                                                                     CLUSTER        CDS        LDS        EDS        RDS          ECDS         ISTIOD                             VERSION
foo-5fcd67944f-2t97k.dev                                           Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED       NOT SENT     istiod-1-18-7-dbcdbb5f4-nth9n      1.18.7
foo-6f8bf4c9b9-qrwf9.prod                                          Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED       NOT SENT     istiod-1-16-7-6d46d45875-gxtzw     1.16.7

Like that NOT SENT... that's where a regex can help here. It's not just tabs, it's a bunch of spaces.

Also, how would you parse Markdown tables (or similar)...? Unless the library trims all those extra spaces... but I guess with markdown there might be more complications that just a delimiter.

koperagen commented 1 week ago

Good questions indeed. I think such tables should be parsed by readDelimStr in the future. For now i can only suggest something like this for Markdown.

fun String.markdownCells() = trim('|').split("|").map { it.trim() }

val s = """
| Month    | Savings |
| -------- | ------- |
| January  | $250    |
| February | $80     |
| March    | $420    |""".trimIndent()

val lines = s.lineSequence()
lines.drop(2).toList().toDataFrame().split { value }.by { it.markdownCells() }.into(lines.first().markdownCells())
dave08 commented 1 week ago

I think that's a bit of an advanced technique for most people with this kind of use case... and it involves parsing in two steps...

I wonder if some kind of readDSL would be better here... it could possibly work by line and give helpers for extracting the titles and values?

koperagen commented 1 week ago

Please share desired API or example of usages that you have in mind. Maybe something like this could be added