clulab / reach

Reach Biomedical Information Extraction
Other
96 stars 39 forks source link

Large files break serial-json output #790

Open kwalcock opened 1 year ago

kwalcock commented 1 year ago

They seem to fail in SerialJsonOutput.scala at

MentionsOps(mentions).json(pretty = true)

in which the entire output is converted into a single string. That string may be over 2GB in length. The exceptions thrown start complaining about negative numbers which are probably integers overflowing.

If that doesn't happen, then it is during the next f.writeString where it can't encode the large string to get it into the file. The error is "java.lang.OutOfMemoryError: Requested array size exceeds VM limit".

There may be a way to send formatted output directly and piecewise to a file without the intermediate string. That should fix the problem. The input file is only 560KB and there are some as large as 3.5 MB that need to be processed.