Closed cbarrick closed 6 years ago
Streaming is not the solution we were looking for. I'm repurposing the issue for general memory performance.
@zachdj made great progress in #15 at the cost of API.
So far it's our best option.
It appears that the UDFs were a big part of the problem. The builtin RegexTokenizer
seems to handle the whole text files without a problem.
I'll see if I can run the small dataset locally
The current preprocessor uses an extreme amount of memory to load in data.
This is a tracking issue for how we might get memory usage down.