OSeMOSYS / otoole

OSeMOSYS Tools for Energy
https://otoole.readthedocs.io
MIT License
23 stars 17 forks source link

Whitespace removal on reading in of data #146

Closed trevorb1 closed 1 year ago

trevorb1 commented 1 year ago

THIS IS A DRAFT PR

Overview

Hi! I was chatting with @tniet the other day about a workflow that is breaking due to whitespace in a CSV file (see example below). We quickly discussed using otoole to strip out whitespace when writing to the datafile. However, thinking about this solution a little more, Im not sure this functionality should be included in otoole.

When whitespace is included in the input data, otoole still behaves as expected; it converts the data without raising an error. If we start using otoole to format input data (even just striping out whitespace), this seems outside otoole's scope of work - which my understanding is just data conversion, data visualizaion, and data validation through a config file.

With that said, I have pushed a quick (draft) fix of what a solution could look like to spur discussion around this topic πŸ™‚ My opinion is that we should close this PR, though. Instead we should direct the user to use the validation functionality of otoole and/or fix the whitspace in their own data processing? Do you have any thoughts on this @tniet, @willu47, or others?

If I am in the minority here though, I am happy to clean up the PR (update logic to fix failing tests, add new test, ect... ) and merge it! πŸ™‚

Example

Current Functionality

If a CSV is formatted as follows (notice the white space in front of the ETH)

REGION,FUEL,YEAR,VALUE
SIMPLICITY, ETH,2014,1.0
SIMPLICITY, ETH,2015,1.03
SIMPLICITY, ETH,2016,1.061
SIMPLICITY, ETH,2017,1.093
SIMPLICITY, ETH,2018,1.126

Running the command otoole convert csv datafile ..., the following datafile will be created

param default 0 : AccumulatedAnnualDemand :=
SIMPLICITY " ETH" 2014 1
SIMPLICITY " ETH" 2015 1.03
SIMPLICITY " ETH" 2016 1.061
SIMPLICITY " ETH" 2017 1.093
SIMPLICITY " ETH" 2018 1.126

If you try to build this file, you get the following error, since " ETH" is not defined.

osemosys_fast_1.0.1.txt:390: AccumulatedAnnualDemand[SIMPLICITY,' ETH',2014] out of domain
MathProg model processing error

Proposed Functionality

If a CSV is formatted as follows (notice the white space in front of the ETH)

REGION,FUEL,YEAR,VALUE
SIMPLICITY, ETH,2014,1.0
SIMPLICITY, ETH,2015,1.03
SIMPLICITY, ETH,2016,1.061
SIMPLICITY, ETH,2017,1.093
SIMPLICITY, ETH,2018,1.126

Running the command otoole convert csv datafile ..., the following datafile will be created

param default 0 : AccumulatedAnnualDemand :=
SIMPLICITY ETH 2014 1
SIMPLICITY ETH 2015 1.03
SIMPLICITY ETH 2016 1.061
SIMPLICITY ETH 2017 1.093
SIMPLICITY ETH 2018 1.126

This datafile will build fine with the glpsol command.

Thanks all!

tniet commented 1 year ago

Hi @trevorb1, @willu47 - Interesting. Looking up the CSV format on Wikipedia (https://en.wikipedia.org/wiki/Comma-separated_values) it's a bit ambiguous as to what the official CSV standard is, and if white space should be removed (excerpt below). However, in any case, otoole should NOT add quote marks to the field as it is doing. Without the quote marks things will work fine with or without the extra space...

From my perspective, OSeMOSYS won't ever allow fields with spaces nor with quotes so I believe that otoole either needs to throw an error or, if the formatting is not ambiguous, strip the white space (preferred). I note that otoole formats numbers correctly and imports " 2016" without the space so it is already formatting the input data and considering spaces in the csv, just not for strings.

From https://en.wikipedia.org/wiki/Comma-separated_values:

In some CSV implementations[which?], leading and trailing spaces and tabs are trimmed (ignored). Such trimming is forbidden by RFC 4180, which states "Spaces are considered part of a field and should not be ignored." 1997, Ford, E350 not same as 1997,Ford,E350 According to RFC 4180, spaces outside quotes in a field are not allowed; however, the RFC also says that "Spaces are considered part of a field and should not be ignored." and "Implementers should 'be conservative in what you do, be liberal in what you accept from others' (RFC 793, section 2.10) when processing CSV files." 1997, "Ford" ,E350 In CSV implementations that do trim leading or trailing spaces, fields with such spaces as meaningful data must be quoted. 1997,Ford,E350," Super luxurious truck "

trevorb1 commented 1 year ago

Thanks for the reply, @tniet! Looking at the the CSV guidelines is a good idea that I definitively hadn't thought of! But I agree, seems to be a little ambiguous haha!

To play devils advocate here a little bit πŸ˜…

At the end of the day, I agree with you that we should discourage the use of spaces! However, I don't know if I'm convinced yet that it is otoole's job to make that decision for the modeller; as not using spaces (leading/training or just in general) is a convention and not a requirement.

Maybe a good compromise here is passing in a flag that will strip leading/trailing whitespace from datafile? Something like below. Note that I think we should only remove leading/trailing spaces, not spaces within the set definitions (ie. a fuel called ELC DEM would be represented as "ELC DEM" in the datafile).

otoole convert csv datafile <csv_folder/> <data_file.txt> config.yaml --strip_whitespace

I am hesitant to do this the other way (ie. have the flag be --keep_whitespace) as unexpected white space can also be a indication to the researcher to investigate why their workflow is adding white spaces in the the first place. I do agree with you @tniet, that having a logger warning on this can be helpful!

What are your thoughts, @tniet, @willu47 and others?! There is no issue if I am outnumbered and we decide to just move forward with the PR as is ☺️. I might just be being difficult and overthinking this haha

tniet commented 1 year ago

@trevorb1 - I'm pretty open to either choice. That being said, I agree that removing white space WITHIN a string would be bad practice (ET H should become "ET H" for example). Removing white space before or after a string seems less controversial. So, I don't mind either way but definitely should have a warning if any strings have leading or trailing spaces. Option to remove them would be ideal, either as default (with option to keep) or as default to keep with option to remove (maybe something to specify in the config file?).

trevorb1 commented 1 year ago

Chatting quickly with @willu47 about this issue, he was leaning towards your solution, @tniet! So I think the proposed solution here is to implement a --keep_whitespace flag. If no flag is given the leading/trailing whitespace is automatically stripped. If the flag is provided, then any leading/trailing whitespace that exists will be kept. Thanks all!

tniet commented 1 year ago

Sounds great! And, in parallel, I've changed my code to not add the spaces as well, so fixed from both sides.