eddelbuettel / rcpptoml

Rcpp Bindings to C++ parser for TOML files
GNU General Public License v2.0
36 stars 9 forks source link

Handling of special characters (line breaks etc..) #26

Closed vh-d closed 6 years ago

vh-d commented 6 years ago

I am trying to understand why parsing of TOML files with multi-line strings such as this

value = '''
Hellow
world!
'''

yields escaped special characters:

List of 1
 $ value: chr "Hellow\\nworld!\\n"

Can't we get this?

List of 1
 $ value: chr "Hellow\nworld!\n"

Am I missing something? Thanks

eddelbuettel commented 6 years ago

What does the TOML spec say?

eddelbuettel commented 6 years ago

You may have a point. When I use the upstream code as a standalone command-line parser, I get:

edd@rob:~/git/cpptoml/examples(master)$ ./parse_stdin_2018-10-22 < /tmp/multiline.toml
{"value":{"type":"string","value":"Hellow\nworld!\n"}}
edd@rob:~/git/cpptoml/examples(master)$

I'll look into it. I already added another missing element on the commute in this morning: plain 'time' input (ie HH:MM:SS) for R has no native support will now be returned (rather than dropped) as a string.

vh-d commented 6 years ago

To me, it looks like escaping special characters was supposed only for printing verbose output (for debugging) in printValue() but it somehow made its way to getValue() as well.

This easy change solves the issue for me. But it is a breaking change of course.

eddelbuettel commented 6 years ago

Yes, it is likely something like that. I have to think if there are other cases of escaped chars we would need to protect. And that protect might be better done at the R accessor ...

Can you take a peek at the TOML spec if it says something about special chars and escapes?

eddelbuettel commented 6 years ago

It seems it is all over cpptoml as well from where I probably took it.

Could you possibly deal with it at the R level?

> RcppTOML::parseTOML("/tmp/twolines.toml")
List of 1
 $ value: chr "hello\\nworld!\\n"
R>
R> txt <- RcppTOML::parseTOML("/tmp/twolines.toml")
R>
R> cat( gsub('\\\\n', '\n', txt[[1]]) )
hello
world!
R>
vh-d commented 6 years ago

Sure that's what I was doing until I stopped wondering why would anyone prefer the current way :-)

I was worried that I would have to handle various cases besides "\n" -> "\n". And when I failed with generic gsub('\\\\', '\\', value) which obviously cannot work I came here :-)

But it seems there are only 3 cases ('\n', '"' and '\') so its relatively small issue.

vh-d commented 6 years ago

Still it may be worth it to have some RcppToml::parseTOML2() without the escaping.

vh-d commented 6 years ago

Doing it on the R level basically means climbing the whole list tree and applying gsub() on any character value. Maybe I will come back if I come up with some elegant solution.

eddelbuettel commented 6 years ago

I think a better solution may be to pass an option down to the parse function which, if set, will skip the escaping.

vh-d commented 6 years ago

exactly... and it would be set to FALSE by default for backward compatibility

eddelbuettel commented 6 years ago

Yup. I see you forked already. Care to try a small and focused pull request? ;-)

eddelbuettel commented 6 years ago

Fixed in #28

tkschmidt commented 5 years ago

I think the issue still exists. I'm using version 0.1.5 and multiline strings will still break it.

e.g.

query = """  SELECT * FROM 
TABLE
WHERE 
property IS NOT NULL;

after loading

toml <- parseTOML(file.path(PATH,"config.toml"))
toml$query: SELECT *FROM \\n (...)

This will break the SQL parser and my dirty solution is dbGetQuery(con, gsub("\\\\n", "", toml$query))

Am I holding it wrongly?

eddelbuettel commented 5 years ago

I am not sure I read the TOML spec and examples the same way you do -- in short, I do not think that looks like valid TOML. Can you point to a TOML example or different TOML parser doing this?

vh-d commented 5 years ago

You need to explicitly set the parameter escape=FALSE. So in your example

toml <- parseTOML(file.path(PATH,"config.toml"), escape=FALSE)

The default behaviour escape=TRUE was kept due to backward compatibility.

tkschmidt commented 5 years ago

I just checked again and I forget to add an important line in the top example

[PT]
query = """  SELECT * FROM 
TABLE
WHERE 
property IS NOT NULL;
"""

This results in

> toml <- parseTOML(file.path("/tmp","config.toml"), escape=F)
> toml
List of 1
 $ PT:List of 1
  ..$ query: chr "  SELECT * FROM TABLE\\nWHERE \\nproperty IS NOT NULL;\\n"

If I now delete the section header

query = """  SELECT * FROM 
TABLE
WHERE 
property IS NOT NULL;
"""

it results in


> toml <- parseTOML(file.path("/tmp","config_no_section.toml"), escape=F)
> toml
List of 1
 $ query: chr "  SELECT * FROM TABLE\nWHERE \nproperty IS NOT NULL;\n"

The toml parser of python eats both versions, but I've never checked if it is correct behaviour.

vh-d commented 5 years ago

OK, that's strange. Are you up to date with master? I believe CRAN version may be outdated.

tkschmidt commented 5 years ago

I updated to master and now it's fine. I didn't check in which state CRAN was because the merge event was 4 days older than the CRAN version. I assumed that CRAN was fine ^^

Could we get git tags that resemble the version in the description file?

Thanks for all your fast replies

eddelbuettel commented 5 years ago

CRAN has versions.

Github has a commit stream. Master may at times be ahead of CRAN.

You have a money-back guarantee, but not much more. Remember we're volunteers here.

tkschmidt commented 5 years ago

Sry, wasn't meant to be that harsh ^^ It's weird to be at the other end of the issue/request stream ;)

salim-b commented 5 years ago

CRAN has versions.

Github has a commit stream. Master may at times be ahead of CRAN.

You have a money-back guarantee, but not much more. Remember we're volunteers here.

I don't wanna overstrain your volunteering here, but I think it would be really cool to update the CRAN version because it doesn't include @vh-d 's second PR #32 which fixes the escape switch in case of TOML tables and arrays.

eddelbuettel commented 5 years ago

Yes, looking at my git there where #32 and the lesser #35 (doc only). Looks like cpptoml has not changed so it should be quick. Just sent something else to CRAN so the timing may be right to do a quick release here too.

vh-d commented 5 years ago

I would have a small PR fixing unicode chars in arrays. I will try to send it ASAP so it would make it to this new CRAN version

eddelbuettel commented 5 years ago

Thanks everybody for the help, particularly @vh-d for the different PRs, @heavywatal for catching an issue with a stale statement in the README.md and @salim-b for the reminder.

I put some work into switching to the (awesome, trust me) and still new-ish tinytest package to have proper tests, also of the installed package (!!) which is really important (and under-appreciated) feature (that testthat lacks in the default settings).

So now with the version in the master branch:

image

As per f22c0aa I renamed it 0.1.5.1 as a release candidate. If you have a moment, give it a test and let me know (here) it is ok. It should be fine. I also added something to ChangeLog and NEWS to reflect the PRs. Thanks again for those.

vh-d commented 5 years ago

Nice. All tests pass on my Linux machine. There is one issue with encoding on Windows (which you have already spotted as well as I see in later commits). See my PR for a possible solution.

eddelbuettel commented 5 years ago

Yes -- the machines at rhub and win-builder are good for such tests.

And thanks for the PR which I just merged.

salim-b commented 5 years ago

Awesome! All tests performed by tinytest::test_package("RcppTOML") run fine here, too (Ubuntu 16.04; R 3.6.0).

eddelbuettel commented 5 years ago

Yes it should as I develop on Ubuntu (albeit "current", now 19.04) and CRAN etc test on it too. Thanks for your patience, last few months were busy but a release was clearly a good idea.

And it does of course make my day that you already groked that you too can now run tests locally! Paging @markvanderloo who is the magician behind the curtain for that one...