Extracting JSON encodable text data from reStructuredText documents

For the updated BIDS extractor I'm reading information intended for a generic description field from any README files in the datalad dataset. This would be for example: README.md, README.rst, README.txt or just README.

Currently I'm doing:

with open(README_fname, 'rb') as f:
    desc = assure_unicode(f.read()).strip()

(with from datalad.utils import assure_unicode) which gets a string from, e.g. the RST doc. Here's an example from the of a studyforrest subdataset:

"An Extension of studyforrest.org Dataset\n****************************************\n\n|license| |access| |doi|\n\nSimultaneous fMRI/eyetracking while movie watching, plus visual localizers\n==========================================================================\n\nThis is an extension of the studyforrest project, all participants previously\nvolunteered for the audio-only Forrest Gump study. The datset is structured in\nBIDS format, details of the files and metadata can be found at:\n\n     Ayan Sengupta, Falko R. Kaule, J. Swaroop Guntupalli, Michael B. Hoffmann,\n     Christian H\u00e4usler, J\u00f6rg Stadler, Michael Hanke. `An extension of the\n     studyforrest dataset for vision research\n     <http://biorxiv.org/content/early/2016/03/31/046573>`_. (submitted for\n     publication)\n\n     Michael Hanke, Nico Adelh\u00f6fer, Daniel Kottke, Vittorio Iacovella,\n     Ayan Sengupta, Falko R. Kaule, Roland Nigbur, Alexander Q. Waite,\n     Florian J. Baumgartner & J\u00f6rg Stadler. `Simultaneous fMRI and eye gaze\n     recordings during prolonged natural stimulation \u2013 a studyforrest extension\n     <http://biorxiv.org/content/early/2016/03/31/046581>`_. (submitted for\n     publication)\n\nFor more information about the project visit: http://studyforrest.org\n\n\nHow to obtain the dataset\n-------------------------\n\nThe dataset is available for download from `OpenFMRI (accession number\nds000113d) <https://www.openfmri.org/dataset/ds000113d>`_.\n\nAlternatively, the `studyforrest phase 2 repository on GitHub\n<https://github.com/psychoinformatics-de/studyforrest-data-phase2>`_ provides\naccess as a DataLad dataset.\n\nDataLad datasets and how to use them\n------------------------------------\n\nThis repository is a `DataLad <https://www.datalad.org/>`__ dataset. It provides\nfine-grained data access down to the level of individual files, and allows for\ntracking future updates up to the level of single files. In order to use\nthis repository for data retrieval, `DataLad <https://www.datalad.org>`_ is\nrequired. It is a free and open source command line tool, available for all\nmajor operating systems, and builds up on Git and `git-annex\n<https://git-annex.branchable.com>`__ to allow sharing, synchronizing, and\nversion controlling collections of large files. You can find information on\nhow to install DataLad at `handbook.datalad.org/en/latest/intro/installation.html\n<http://handbook.datalad.org/en/latest/intro/installation.html>`_.\n\nGet the dataset\n^^^^^^^^^^^^^^^\n\nA DataLad dataset can be ``cloned`` by running::\n\n   datalad clone <url>\n\nOnce a dataset is cloned, it is a light-weight directory on your local machine.\nAt this point, it contains only small metadata and information on the\nidentity of the files in the dataset, but not actual *content* of the\n(sometimes large) data files.\n\nRetrieve dataset content\n^^^^^^^^^^^^^^^^^^^^^^^^\n\nAfter cloning a dataset, you can retrieve file contents by running::\n\n   datalad get <path/to/directory/or/file>\n\nThis command will trigger a download of the files, directories, or\nsubdatasets you have specified.\n\nDataLad datasets can contain other datasets, so called *subdatasets*. If you\nclone the top-level dataset, subdatasets do not yet contain metadata and\ninformation on the identity of files, but appear to be empty directories. In\norder to retrieve file availability metadata in subdatasets, run::\n\n   datalad get -n <path/to/subdataset>\n\nAfterwards, you can browse the retrieved metadata to find out about\nsubdataset contents, and retrieve individual files with ``datalad get``. If you\nuse ``datalad get <path/to/subdataset>``, all contents of the subdataset will\nbe downloaded at once.\n\nStay up-to-date\n^^^^^^^^^^^^^^^\n\nDataLad datasets can be updated. The command ``datalad update`` will *fetch*\nupdates and store them on a different branch (by default\n``remotes/origin/master``). Running::\n\n   datalad update --merge\n\nwill *pull* available updates and integrate them in one go.\n\nMore information\n^^^^^^^^^^^^^^^^\n\nMore information on DataLad and how to use it can be found in the DataLad Handbook at\n`handbook.datalad.org <http://handbook.datalad.org/en/latest/index.html>`_. The\nchapter \"DataLad datasets\" can help you to familiarize yourself with the\nconcept of a dataset.\n\n\n.. _Git: http://www.git-scm.com\n\n.. _git-annex: http://git-annex.branchable.com/\n\n.. |license|\n   image:: https://img.shields.io/badge/license-PDDL-blue.svg\n    :target: http://opendatacommons.org/licenses/pddl/summary\n    :alt: PDDL-licensed\n\n.. |access|\n   image:: https://img.shields.io/badge/data_access-unrestricted-green.svg\n    :alt: No registration or authentication required\n\n.. |doi|\n   image:: https://zenodo.org/badge/14167/psychoinformatics-de/studyforrest-data-phase2.svg\n    :target: https://zenodo.org/badge/latestdoi/14167/psychoinformatics-de/studyforrest-data-phase2\n    :alt: DOI"

However, when I process this field as part of a larger JSON object with jq, I get an error:

parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 148, column 14

It looks like the assure_unicode function did not succeed in properly escaping the unicode expressions?

If I first convert the rst doc to md using pandoc:

pandoc ../Data/studyforrest-data/original/phase2/README.rst -f rst -t markdown -o READMEPHAS2.md

and then read it in the same way as before, I get:

"'# An Extension of studyforrest.org Dataset\n\n## Simultaneous fMRI/eyetracking while movie watching, plus visual localizers\n\nThis is an extension of the studyforrest project, all participants\npreviously volunteered for the audio-only Forrest Gump study. The datset\nis structured in BIDS format, details of the files and metadata can be\nfound at:\n\n> Ayan Sengupta, Falko R. Kaule, J. Swaroop Guntupalli, Michael B.\n> Hoffmann, Christian Häusler, Jörg Stadler, Michael Hanke. [An\n> extension of the studyforrest dataset for vision\n> research](http://biorxiv.org/content/early/2016/03/31/046573).\n> (submitted for publication)\n>\n> Michael Hanke, Nico Adelhöfer, Daniel Kottke, Vittorio Iacovella, Ayan\n> Sengupta, Falko R. Kaule, Roland Nigbur, Alexander Q. Waite, Florian\n> J. Baumgartner & Jörg Stadler. [Simultaneous fMRI and eye gaze\n> recordings during prolonged natural stimulation -- a studyforrest\n> extension](http://biorxiv.org/content/early/2016/03/31/046581).\n> (submitted for publication)\n\nFor more information about the project visit: <http://studyforrest.org>\n\n### How to obtain the dataset\n\nThe dataset is available for download from [OpenFMRI (accession number\nds000113d)](https://www.openfmri.org/dataset/ds000113d).\n\nAlternatively, the [studyforrest phase 2 repository on\nGitHub](https://github.com/psychoinformatics-de/studyforrest-data-phase2)\nprovides access as a DataLad dataset.\n\n### DataLad datasets and how to use them\n\nThis repository is a [DataLad](https://www.datalad.org/) dataset. It\nprovides fine-grained data access down to the level of individual files,\nand allows for tracking future updates up to the level of single files.\nIn order to use this repository for data retrieval,\n[DataLad](https://www.datalad.org) is required. It is a free and open\nsource command line tool, available for all major operating systems, and\nbuilds up on Git and [git-annex](https://git-annex.branchable.com) to\nallow sharing, synchronizing, and version controlling collections of\nlarge files. You can find information on how to install DataLad at\n[handbook.datalad.org/en/latest/intro/installation.html](http://handbook.datalad.org/en/latest/intro/installation.html).\n\n#### Get the dataset\n\nA DataLad dataset can be `cloned` by running:\n\n    datalad clone <url>\n\nOnce a dataset is cloned, it is a light-weight directory on your local\nmachine. At this point, it contains only small metadata and information\non the identity of the files in the dataset, but not actual *content* of\nthe (sometimes large) data files.\n\n#### Retrieve dataset content\n\nAfter cloning a dataset, you can retrieve file contents by running:\n\n    datalad get <path/to/directory/or/file>\n\nThis command will trigger a download of the files, directories, or\nsubdatasets you have specified.\n\nDataLad datasets can contain other datasets, so called *subdatasets*. If\nyou clone the top-level dataset, subdatasets do not yet contain metadata\nand information on the identity of files, but appear to be empty\ndirectories. In order to retrieve file availability metadata in\nsubdatasets, run:\n\n    datalad get -n <path/to/subdataset>\n\nAfterwards, you can browse the retrieved metadata to find out about\nsubdataset contents, and retrieve individual files with `datalad get`.\nIf you use `datalad get <path/to/subdataset>`, all contents of the\nsubdataset will be downloaded at once.\n\n#### Stay up-to-date\n\nDataLad datasets can be updated. The command `datalad update` will\n*fetch* updates and store them on a different branch (by default\n`remotes/origin/master`). Running:\n\n    datalad update --merge\n\nwill *pull* available updates and integrate them in one go.\n\n#### More information\n\nMore information on DataLad and how to use it can be found in the\nDataLad Handbook at\n[handbook.datalad.org](http://handbook.datalad.org/en/latest/index.html).\nThe chapter \\"DataLad datasets\\" can help you to familiarize yourself\nwith the concept of a dataset.'"

It looks like unicode characters render correctly.

Then, when I process this field as part of a larger JSON object with jq, I get a different error:

parse error: Invalid numeric literal at line 36, column 3942

which points to this part of the string: \\"DataLad datasets\\".

I'm not sure what would be the best way of handling this text extraction such that it can be encoded/decoded in JSON without errors. Any thoughts?

If I have an object with minimal fields saved in a text file (test.json)

{
    "desc_md": "'# An Extension of studyforrest.org Dataset\n\n## Simultaneous fMRI/eyetracking while movie watching, plus visual localizers\n\nThis is an extension of the studyforrest project, all participants\npreviously volunteered for the audio-only Forrest Gump study. The datset\nis structured in BIDS format, details of the files and metadata can be\nfound at:\n\n> Ayan Sengupta, Falko R. Kaule, J. Swaroop Guntupalli, Michael B.\n> Hoffmann, Christian Häusler, Jörg Stadler, Michael Hanke. [An\n> extension of the studyforrest dataset for vision\n> research](http://biorxiv.org/content/early/2016/03/31/046573).\n> (submitted for publication)\n>\n> Michael Hanke, Nico Adelhöfer, Daniel Kottke, Vittorio Iacovella, Ayan\n> Sengupta, Falko R. Kaule, Roland Nigbur, Alexander Q. Waite, Florian\n> J. Baumgartner & Jörg Stadler. [Simultaneous fMRI and eye gaze\n> recordings during prolonged natural stimulation -- a studyforrest\n> extension](http://biorxiv.org/content/early/2016/03/31/046581).\n> (submitted for publication)\n\nFor more information about the project visit: <http://studyforrest.org>\n\n### How to obtain the dataset\n\nThe dataset is available for download from [OpenFMRI (accession number\nds000113d)](https://www.openfmri.org/dataset/ds000113d).\n\nAlternatively, the [studyforrest phase 2 repository on\nGitHub](https://github.com/psychoinformatics-de/studyforrest-data-phase2)\nprovides access as a DataLad dataset.\n\n### DataLad datasets and how to use them\n\nThis repository is a [DataLad](https://www.datalad.org/) dataset. It\nprovides fine-grained data access down to the level of individual files,\nand allows for tracking future updates up to the level of single files.\nIn order to use this repository for data retrieval,\n[DataLad](https://www.datalad.org) is required. It is a free and open\nsource command line tool, available for all major operating systems, and\nbuilds up on Git and [git-annex](https://git-annex.branchable.com) to\nallow sharing, synchronizing, and version controlling collections of\nlarge files. You can find information on how to install DataLad at\n[handbook.datalad.org/en/latest/intro/installation.html](http://handbook.datalad.org/en/latest/intro/installation.html).\n\n#### Get the dataset\n\nA DataLad dataset can be `cloned` by running:\n\n    datalad clone <url>\n\nOnce a dataset is cloned, it is a light-weight directory on your local\nmachine. At this point, it contains only small metadata and information\non the identity of the files in the dataset, but not actual *content* of\nthe (sometimes large) data files.\n\n#### Retrieve dataset content\n\nAfter cloning a dataset, you can retrieve file contents by running:\n\n    datalad get <path/to/directory/or/file>\n\nThis command will trigger a download of the files, directories, or\nsubdatasets you have specified.\n\nDataLad datasets can contain other datasets, so called *subdatasets*. If\nyou clone the top-level dataset, subdatasets do not yet contain metadata\nand information on the identity of files, but appear to be empty\ndirectories. In order to retrieve file availability metadata in\nsubdatasets, run:\n\n    datalad get -n <path/to/subdataset>\n\nAfterwards, you can browse the retrieved metadata to find out about\nsubdataset contents, and retrieve individual files with `datalad get`.\nIf you use `datalad get <path/to/subdataset>`, all contents of the\nsubdataset will be downloaded at once.\n\n#### Stay up-to-date\n\nDataLad datasets can be updated. The command `datalad update` will\n*fetch* updates and store them on a different branch (by default\n`remotes/origin/master`). Running:\n\n    datalad update --merge\n\nwill *pull* available updates and integrate them in one go.\n\n#### More information\n\nMore information on DataLad and how to use it can be found in the\nDataLad Handbook at\n[handbook.datalad.org](http://handbook.datalad.org/en/latest/index.html).\nThe chapter \\"DataLad datasets\\" can help you to familiarize yourself\nwith the concept of a dataset.'",
    "desc_rst": "An Extension of studyforrest.org Dataset\n****************************************\n\n|license| |access| |doi|\n\nSimultaneous fMRI/eyetracking while movie watching, plus visual localizers\n==========================================================================\n\nThis is an extension of the studyforrest project, all participants previously\nvolunteered for the audio-only Forrest Gump study. The datset is structured in\nBIDS format, details of the files and metadata can be found at:\n\n     Ayan Sengupta, Falko R. Kaule, J. Swaroop Guntupalli, Michael B. Hoffmann,\n     Christian H\u00e4usler, J\u00f6rg Stadler, Michael Hanke. `An extension of the\n     studyforrest dataset for vision research\n     <http://biorxiv.org/content/early/2016/03/31/046573>`_. (submitted for\n     publication)\n\n     Michael Hanke, Nico Adelh\u00f6fer, Daniel Kottke, Vittorio Iacovella,\n     Ayan Sengupta, Falko R. Kaule, Roland Nigbur, Alexander Q. Waite,\n     Florian J. Baumgartner & J\u00f6rg Stadler. `Simultaneous fMRI and eye gaze\n     recordings during prolonged natural stimulation \u2013 a studyforrest extension\n     <http://biorxiv.org/content/early/2016/03/31/046581>`_. (submitted for\n     publication)\n\nFor more information about the project visit: http://studyforrest.org\n\n\nHow to obtain the dataset\n-------------------------\n\nThe dataset is available for download from `OpenFMRI (accession number\nds000113d) <https://www.openfmri.org/dataset/ds000113d>`_.\n\nAlternatively, the `studyforrest phase 2 repository on GitHub\n<https://github.com/psychoinformatics-de/studyforrest-data-phase2>`_ provides\naccess as a DataLad dataset.\n\nDataLad datasets and how to use them\n------------------------------------\n\nThis repository is a `DataLad <https://www.datalad.org/>`__ dataset. It provides\nfine-grained data access down to the level of individual files, and allows for\ntracking future updates up to the level of single files. In order to use\nthis repository for data retrieval, `DataLad <https://www.datalad.org>`_ is\nrequired. It is a free and open source command line tool, available for all\nmajor operating systems, and builds up on Git and `git-annex\n<https://git-annex.branchable.com>`__ to allow sharing, synchronizing, and\nversion controlling collections of large files. You can find information on\nhow to install DataLad at `handbook.datalad.org/en/latest/intro/installation.html\n<http://handbook.datalad.org/en/latest/intro/installation.html>`_.\n\nGet the dataset\n^^^^^^^^^^^^^^^\n\nA DataLad dataset can be ``cloned`` by running::\n\n   datalad clone <url>\n\nOnce a dataset is cloned, it is a light-weight directory on your local machine.\nAt this point, it contains only small metadata and information on the\nidentity of the files in the dataset, but not actual *content* of the\n(sometimes large) data files.\n\nRetrieve dataset content\n^^^^^^^^^^^^^^^^^^^^^^^^\n\nAfter cloning a dataset, you can retrieve file contents by running::\n\n   datalad get <path/to/directory/or/file>\n\nThis command will trigger a download of the files, directories, or\nsubdatasets you have specified.\n\nDataLad datasets can contain other datasets, so called *subdatasets*. If you\nclone the top-level dataset, subdatasets do not yet contain metadata and\ninformation on the identity of files, but appear to be empty directories. In\norder to retrieve file availability metadata in subdatasets, run::\n\n   datalad get -n <path/to/subdataset>\n\nAfterwards, you can browse the retrieved metadata to find out about\nsubdataset contents, and retrieve individual files with ``datalad get``. If you\nuse ``datalad get <path/to/subdataset>``, all contents of the subdataset will\nbe downloaded at once.\n\nStay up-to-date\n^^^^^^^^^^^^^^^\n\nDataLad datasets can be updated. The command ``datalad update`` will *fetch*\nupdates and store them on a different branch (by default\n``remotes/origin/master``). Running::\n\n   datalad update --merge\n\nwill *pull* available updates and integrate them in one go.\n\nMore information\n^^^^^^^^^^^^^^^^\n\nMore information on DataLad and how to use it can be found in the DataLad Handbook at\n`handbook.datalad.org <http://handbook.datalad.org/en/latest/index.html>`_. The\nchapter \"DataLad datasets\" can help you to familiarize yourself with the\nconcept of a dataset.\n\n\n.. _Git: http://www.git-scm.com\n\n.. _git-annex: http://git-annex.branchable.com/\n\n.. |license|\n   image:: https://img.shields.io/badge/license-PDDL-blue.svg\n    :target: http://opendatacommons.org/licenses/pddl/summary\n    :alt: PDDL-licensed\n\n.. |access|\n   image:: https://img.shields.io/badge/data_access-unrestricted-green.svg\n    :alt: No registration or authentication required\n\n.. |doi|\n   image:: https://zenodo.org/badge/14167/psychoinformatics-de/studyforrest-data-phase2.svg\n    :target: https://zenodo.org/badge/latestdoi/14167/psychoinformatics-de/studyforrest-data-phase2\n    :alt: DOI"
}

Then the following command gives the error:

cat test.json | jq .
parse error: Invalid numeric literal at line 2, column 3949

looks like the extra double quotes \\"DataLad datasets\\" is the problem. This error goes away if the content is replaced by 'DataLad datasets'. And the rst string does not give an error as above... (couold be that the original error was not because of the rst string...)

datalad / datalad-neuroimaging

Extracting JSON encodable text data from reStructuredText documents #103