GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
https://gatenlp.github.io/gateplugin-LearningFramework/
GNU Lesser General Public License v2.1
26 stars 6 forks source link

Dense JSON Corpus Representation: double check escaping of new line characters #57

Closed johann-petrak closed 6 years ago

johann-petrak commented 6 years ago

Since the JSON stored in the file is line-oriented, with one instance/sequence per line, any literal new line character would mess that up.

For this reason, no path that leads to the serialization of a feature value into JSON should ever allow a non-escaped new line character to to appear. Either we remove those, replace with a space or escape them.

Best approach is probably to replace with space.

ianroberts commented 6 years ago

JSON string literals aren’t allowed to contain unescaped newlines so any competent JSON serialiser library will turn them into \n by default, but there can be non-significant whitespace between syntactic tokens (after a comma, either side of a colon, etc.)

Is this one line per instance thing a restriction in imposed by a third party library or could your reading code use the same technique as GCP’s JSONStreamingInputHandler, which can cope with any concatenated stream of JSON objects even when they are pretty printed with new lines between properties?

johann-petrak commented 6 years ago

Thank you - my mistake, I was not sure if the specs requires this and filed this issue to check later :smile:

The one-line per instance thing is imposed by me for interchanging the data between the Java code (writing) and the python code of the backend (reading) and also allow for very simple file splitting to create a validation set, and possibly in the future for better handling of parallel writing of data.