Closed trevorb1 closed 1 year ago
Hi @trevorb1, @willu47 - Interesting. Looking up the CSV format on Wikipedia (https://en.wikipedia.org/wiki/Comma-separated_values) it's a bit ambiguous as to what the official CSV standard is, and if white space should be removed (excerpt below). However, in any case, otoole should NOT add quote marks to the field as it is doing. Without the quote marks things will work fine with or without the extra space...
From my perspective, OSeMOSYS won't ever allow fields with spaces nor with quotes so I believe that otoole either needs to throw an error or, if the formatting is not ambiguous, strip the white space (preferred). I note that otoole formats numbers correctly and imports " 2016" without the space so it is already formatting the input data and considering spaces in the csv, just not for strings.
From https://en.wikipedia.org/wiki/Comma-separated_values:
In some CSV implementations[which?], leading and trailing spaces and tabs are trimmed (ignored). Such trimming is forbidden by RFC 4180, which states "Spaces are considered part of a field and should not be ignored." 1997, Ford, E350 not same as 1997,Ford,E350 According to RFC 4180, spaces outside quotes in a field are not allowed; however, the RFC also says that "Spaces are considered part of a field and should not be ignored." and "Implementers should 'be conservative in what you do, be liberal in what you accept from others' (RFC 793, section 2.10) when processing CSV files." 1997, "Ford" ,E350 In CSV implementations that do trim leading or trailing spaces, fields with such spaces as meaningful data must be quoted. 1997,Ford,E350," Super luxurious truck "
Thanks for the reply, @tniet! Looking at the the CSV guidelines is a good idea that I definitively hadn't thought of! But I agree, seems to be a little ambiguous haha!
To play devils advocate here a little bit π
ETH
is changed to " ETH"
), the model will still build and give results. Therefore, I actually do think otoole
should be adding the quotes if spaces are identified in a string. v1.0
release of otoole
, the functionality can easily be extended to work with any MathProg file, not just OSeMOSYS files. So while within the OSeMOSYS community it may be encouraged not to use spaces, it can potentially limit the further use of otoole
. (I do admit this is a bit of a long shot though, cause Im not sure how many people have workflows that use MathProg files haha π) " 2016"
) in your example is due to otoole
explicitly defining data types. Since int
s can't have spaces, pandas will strip out the white space (maybe @willu47 can confirm this though). In the v1.0
release, all data types are defined by the user in a config file. At the end of the day, I agree with you that we should discourage the use of spaces! However, I don't know if I'm convinced yet that it is otoole
's job to make that decision for the modeller; as not using spaces (leading/training or just in general) is a convention and not a requirement.
Maybe a good compromise here is passing in a flag that will strip leading/trailing whitespace from datafile? Something like below. Note that I think we should only remove leading/trailing spaces, not spaces within the set definitions (ie. a fuel called ELC DEM
would be represented as "ELC DEM"
in the datafile).
otoole convert csv datafile <csv_folder/> <data_file.txt> config.yaml --strip_whitespace
I am hesitant to do this the other way (ie. have the flag be --keep_whitespace
) as unexpected white space can also be a indication to the researcher to investigate why their workflow is adding white spaces in the the first place. I do agree with you @tniet, that having a logger warning on this can be helpful!
What are your thoughts, @tniet, @willu47 and others?! There is no issue if I am outnumbered and we decide to just move forward with the PR as is βΊοΈ. I might just be being difficult and overthinking this haha
@trevorb1 - I'm pretty open to either choice. That being said, I agree that removing white space WITHIN a string would be bad practice (ET H should become "ET H" for example). Removing white space before or after a string seems less controversial. So, I don't mind either way but definitely should have a warning if any strings have leading or trailing spaces. Option to remove them would be ideal, either as default (with option to keep) or as default to keep with option to remove (maybe something to specify in the config file?).
Chatting quickly with @willu47 about this issue, he was leaning towards your solution, @tniet! So I think the proposed solution here is to implement a --keep_whitespace
flag. If no flag is given the leading/trailing whitespace is automatically stripped. If the flag is provided, then any leading/trailing whitespace that exists will be kept. Thanks all!
Sounds great! And, in parallel, I've changed my code to not add the spaces as well, so fixed from both sides.
THIS IS A DRAFT PR
Overview
Hi! I was chatting with @tniet the other day about a workflow that is breaking due to whitespace in a CSV file (see example below). We quickly discussed using
otoole
to strip out whitespace when writing to the datafile. However, thinking about this solution a little more, Im not sure this functionality should be included inotoole
.When whitespace is included in the input data,
otoole
still behaves as expected; it converts the data without raising an error. If we start using otoole to format input data (even just striping out whitespace), this seems outsideotoole
's scope of work - which my understanding is just data conversion, data visualizaion, and data validation through a config file.With that said, I have pushed a quick (draft) fix of what a solution could look like to spur discussion around this topic π My opinion is that we should close this PR, though. Instead we should direct the user to use the validation functionality of
otoole
and/or fix the whitspace in their own data processing? Do you have any thoughts on this @tniet, @willu47, or others?If I am in the minority here though, I am happy to clean up the PR (update logic to fix failing tests, add new test, ect... ) and merge it! π
Example
Current Functionality
If a CSV is formatted as follows (notice the white space in front of the
ETH
)Running the command
otoole convert csv datafile ...
, the following datafile will be createdIf you try to build this file, you get the following error, since
" ETH"
is not defined.Proposed Functionality
If a CSV is formatted as follows (notice the white space in front of the
ETH
)Running the command
otoole convert csv datafile ...
, the following datafile will be createdThis datafile will build fine with the
glpsol
command.Thanks all!