GetDKAN / dkan

DKAN Open Data Portal
https://dkan.readthedocs.io/en/latest/index.html
GNU General Public License v2.0
365 stars 170 forks source link

Support CSV dialects #3864

Open dafeder opened 1 year ago

dafeder commented 1 year ago

There are a lot of permutations of CSV out there, from TSV to things like semicolon-delimited files, to different escaping methods, etc. Even though both DKAN's native CSV parser and the mysql LOAD DATA importer can be configured to support most of these permutations, there is no easy way to do this in DKAN, on either a per-resource or system-wide level.

Frictionless Data project has a spec designed to address just this issue, CSV Dialect. We should explore ways to support different dialects in importers, and figure out the most efficient way to communicate which dialect to use to the importer on a per-resource basis.

stefan-korn commented 3 months ago

@dafeder : This is of interest for us. In Germany the delimiter is usually a semicolon, and even MS Excel or the liking use the semicolon delimiter by default in german versions.

Am I right that currently the delimiter is hardcoded in this place: https://github.com/GetDKAN/dkan/blob/2.x/modules/datastore/src/Service/ImportService.php#L166

And right now there is no configuration option for this?

Regarding

figure out the most efficient way to communicate which dialect to use to the importer on a per-resource basis.

Do you already have something in mind? Would extending the distribution schema about an optional field to define the CSV dialect be a viable option from your point of view?

dafeder commented 3 months ago

We are in a tricky spot because we are trying to stay as close to DCAT as possible, but this is kind of outside the scope of DCAT. I think as a stopgap we should figure out some relatively straightforward way to override that hardcoded value, but it may be that a better solution is to have a system outside of the metastore completely for storing file resources, perhaps as part of the datastore, and decouple that as much as possible from the metastore. This is sort of already the case but Resources are basically just a URL and a timestamp at the moment.

dafeder commented 3 months ago

Also, there is a way to do this, sort of, with event listeners. The ImportService::EVENT_CONFIGURE_PARSER event would allow you to change the delimiter character, but you would need to define all your conditional logic there. Will make a note to document this in a recipe, but something like:

$events[Import::EVENT_CONFIGURE_PARSER][] = [‘set’];

[...]

public function set(Event $event) {
    $parserConfiguration = $event->getData();
    $parserConfiguration['delimeter'] = ';';
    $event->setData($parserConfiguration);
}

h/t @janette

stefan-korn commented 3 months ago

@dafeder : Thanks a lot for the hint. This works nicely to change it to semicolon overall. I missed that out. Still often times I am only looking for the good old hooks and forgetting about the new synfony events ... By the way and off-topic: is there a documentation standard like for the hooks for events? I just searched a bit and could not find anything fruitful about this.

Regarding conditional logic: the event gets only the parser configuration as data? So I have no clue about the resource that is parsed here? Or am I missing something again?

dafeder commented 3 months ago

I think you're right, it's basically all or nothing, sorry to lead you astray there. And yeah, documenting those events has been on our to-do list for a long time now, this is a good reminder.

stefan-korn commented 3 months ago

fyi #4176