datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
193 stars 39 forks source link

Update schema with missing values #154

Open gperonato opened 3 years ago

gperonato commented 3 years ago

I have a source dataset with missing values corresponding to NULL. In my flow, I use: update_schema(None, missingValues=["NULL"]) The resulting datapackage.json has the missingValues field set as above, while the dumped files have empty fields (if I use CSV) or null (if I use JSON). Now I cannot parse the dumped file using the datapackage.json, as its schema corresponds to the original source file. Is this the expected behavior? Or is there another way of dealing with missing values? I am sorry, this is probably a basic understanding question. Hope that someone can help.

akariv commented 3 years ago

You are right - I think the correct behaviour here should be to clear the missingValues field in the schema prior to writing the datapackage.json file. wdyt?

On Thu, Nov 26, 2020 at 11:25 AM Giuseppe Peronato notifications@github.com wrote:

I have a source dataset with missing values corresponding to NULL. In my flow, I use: update_schema(None, missingValues=["NULL"]) The resulting datapackage.json has the missingValues field set as above, while the dumped files have empty fields (if I use CSV) or null (if I use JSON). Now I cannot parse the dumped file using the datapackage.json, as its schema corresponds to the original source file. Is this the expected behavior? Or is there another way of dealing with missing values? I am sorry, this is probably a basic understanding question. Hope that someone can help.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/datahq/dataflows/issues/154, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5PFLW345BWUE22YPUTSRYNIPANCNFSM4UDO57FQ .

gperonato commented 3 years ago

I was considering leaving the datapackage.json unchanged (i.e., with the updated schema), and preserving the missingValues in the dumped files. This because, if I update the schema in my flow I'd like to see that change in the output datapackage.json. But the disadvantage would be to have non-standard missingValues. So probably your approach gives a cleaner result. Either way would be ok, as long as it gives a schema that allows the parsing of the dumped file.