Newline token conversion between markdown and json formats

AMR-K commented 4 years ago

Rasa version: rasa==1.10.3

Rasa SDK version (if used & relevant): rasa-sdk==1.10.2

Rasa X version (if used & relevant):

Python version: Python 3.7.8

Operating system Ubuntu 20

Issue: My team has training datasets with newline tokens \n as part of the text field in json files. We generally use the markdown format for inspecting the datafiles before converting them back to json so that we can easily manipulate them. But, converting the same json file to markdown and then back to json causes the escaping of newline tokens which isn't desirable.

Error (including full traceback):

Command or request that led to error:

$ cat json_input.json
{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "intent": "foo",
        "text": "bar \n bar \n bar"
      }
    ]
  }
}

$ rasa data convert nlu --data json_input.json --out markdown.md -f md
$ cat markdown.md
## intent:foo
- bar \n bar \n bar

$ rasa data convert nlu --data markdown.md --out json_output.json -f json
$ cat json_output.json
{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "intent": "foo",
        "text": "bar \\n bar \\n bar"
      }
    ],
    "regex_features": [],
    "lookup_tables": [],
    "entity_synonyms": []
  }
}

Code responsible for the issue: https://github.com/RasaHQ/rasa/blob/88ad06f3234ef68ecea9076e19747f3a07a097f4/rasa/nlu/training_data/formats/markdown.py#L51

https://github.com/RasaHQ/rasa/blob/88ad06f3234ef68ecea9076e19747f3a07a097f4/rasa/nlu/training_data/formats/markdown.py#L70

sara-tagger commented 4 years ago

Thanks for the issue, @tmbo will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

AMR-KELEG commented 4 years ago

@tabergma sorry for the tag if it's somehow spammy but can you help me with this issue. The way \n characters are escaped distorts my training data files. Thanks :sweat_smile:

AMR-KELEG commented 4 years ago

@akelad Could you please check this issue? Is there a reason for the way \n tokens are escaped in this way?

akelad commented 4 years ago

It's been added to one of our teams inboxes - can I ask how come you're using JSON in the first place? I believe that format might be deprecated soon

AMR-KELEG commented 4 years ago

It's been added to one of our teams inboxes - can I ask how come you're using JSON in the first place? I believe that format might be deprecated soon

Well, I have just checked the rasa blog post for version 2.0 and noticed that yaml will be the format for data files. Json is the format that my team has been using for a while now and it's convenient since it can be easily manipulated / read by different programming languages. I don't find json to be human-readable and I preferred the MD format so that's why I needed to convert json files to MD, manipulate them and then convert them back to json.

akelad commented 4 years ago

yeah that makes sense - would using yaml once 2.0 be a good replacement option for you for json? Json will still be around for a while, but we will be encouraging users to switch to the new format.

Also, since you already found the area of the code that causes this issue, would you be up for submitting a PR to fix it?

AMR-KELEG commented 4 years ago

I have only used yaml for pipeline configurations so I am not sure how it's used for nlu data (will give it a try soon). I have created a PR that un-escapes the \n tokens in a markdown file.

akelad commented 4 years ago

nice thanks!

AMR-KELEG commented 4 years ago

O/ Akela,

I am checking the live docs https://rasa.com/docs/rasa/nlu/training-data-format/#data-formats but it looks like the yaml format isn't yet part of it. Will the docs be updated soon? I find it easier/ more convenient to check the online docs other than building them from source.

Thanks :smile:

akelad commented 4 years ago

It's still a work in progress sorry! you can take a peek here: https://github.com/RasaHQ/rasa/pull/6297/files

tmbo commented 4 years ago

@AMR-KELEG still working on the docs but we'll have an update soon. once we merged the PR it will be available at https://rasa.com/docs/rasa/next

AMR-KELEG commented 4 years ago

It's still a work in progress sorry! you can take a peek here: https://github.com/RasaHQ/rasa/pull/6297/files

No worries :smile: Thanks for the pointer. I will check the rst file for now.

RasaHQ / rasa

Newline token conversion between markdown and json formats #6087

You may find help in the docs and the forum, too 🤗