Smrtovrisk / google-refine

Automatically exported from code.google.com/p/google-refine
Other
0 stars 0 forks source link

Implement or remove the line separator option #477

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Many of the importer options panels have a field to specify a line terminator 
other than the default \n, but, as far as I can tell, no importer actually 
implements that option.

We should either remove this option from the UI or get it implemented in the 
importer architecture.  I'm not sure, but I think we'd probably need to 
implement a filter which replaces the desired line terminator character with \n 
before passing the character stream along.

Original issue reported on code.google.com by tfmorris on 4 Nov 2011 at 7:20

GoogleCodeExporter commented 9 years ago

Original comment by tfmorris on 5 Nov 2011 at 3:59

GoogleCodeExporter commented 9 years ago
I added that option just because it was parallel to the column separator 
option. But I don't know if it's really needed in practice. Maybe someone who 
has dealt with more data can chime in.

If it turns out that we need it, then I can just implement something similar to 
LineNumberReader.

Original comment by dfhu...@gmail.com on 6 Nov 2011 at 5:17

GoogleCodeExporter commented 9 years ago
It looks like old-skool ASCII files (binary MARC21 is one) which use the ASCII 
control characters might need this, but I'm not sure how common it would be for 
folks to need to process this type of file.

You're right that overriding BufferedReader.readline() (which is what it's 
subclass LineNumberReader uses) would be more efficient than implementing a 
FilterReader.

Original comment by tfmorris on 6 Nov 2011 at 2:24

GoogleCodeExporter commented 9 years ago
http://ronaldduncan.wordpress.com/category/software/file-formats/
http://www.loc.gov/marc/specifications/specrecstruc.html

Original comment by tfmorris on 6 Nov 2011 at 2:25

GoogleCodeExporter commented 9 years ago
Agreed with Tom, this comes up typically only in old data sets like MARC, circa 
early 1980's and earlier, probably about the time that Tape storage died 
(Linear and used Control Characters to give position info) and where Hard Disks 
(Non-Linear) became cheaper.  I have not had to deal with control char hysteria 
within ANY data set during any of my enterprise data migrations in the last 15 
years.  Safe to remove I think.

Original comment by thadguidry on 6 Nov 2011 at 3:53

GoogleCodeExporter commented 9 years ago
I'll remove those options then.

Original comment by dfhu...@gmail.com on 6 Nov 2011 at 6:28

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r2364.

Original comment by dfhu...@google.com on 6 Nov 2011 at 8:13