ClimbsRocks / data-formatter

Takes raw csv input and formats it to be ready for neural networks
19 stars 7 forks source link

long-term nlp column cleanup #76

Open ClimbsRocks opened 8 years ago

ClimbsRocks commented 8 years ago

it's kind of icky manual work, but:

we'd have to do this right at the start, after reading in the dataDescription rows to figure out that we have an nlp column, but before we do anything else.

we could go through the whole raw document.

for each row, ignore the number of commas up to the nlp column, and then the correct number of commas after the nlp column to the end of the row.

then concat everything else in there together. then remove all strings, quotes, newline characters, etc.

or, we could just find a proper csv parser that can handle things like unbalanced quotes with commas, etc.