djvanderlaan / LaF

An R package for fast access to large ASCII files
53 stars 9 forks source link

Handle fixed-width (ish) files that end the final field prematurely? #6

Open mikedolanfliss opened 9 years ago

mikedolanfliss commented 9 years ago

Big fan of the package.

SQL Server seems to dump fixed width files with a premature \r\n end line on the last column on an otherwise fixed-width file. Anyway to handle that with LaF in the future? As is, LaF reads over the end-of-line into the next record, and all is buggered. :)

djvanderlaan commented 9 years ago

Thanks for reporting.

Could you add a small (constructed) example? It is not completely clear what your file looks like.

I'll have to see how, if and when I'm able to add this to LaF. For fast random access LaF uses the fixedness of the fixed width files. To reading line 10.000 can be done by skipping to byte (10.000 - 1)*(sum(widths of all columns) + width of new line). This would no longer work for your file.

So, this would result in a new reader (besides laf_open_csv and laf_open_fwf). However, this would also be usable for UTF-8 'fixed width' files.

mikedolanfliss commented 9 years ago

Thanks for following up! It's definitely more an issue of the data file than LaF, and for fixed-width LaF obviously requires that structure for access.

Creating a new reader might be a possibility - or a function to reformat into a true fixed width by standardizing the whitespace/length in the last field. That is, dealing with these sorts of "fixed-width" records - sometimes happens with SQL server, and others that export poorly formatted fwfs. With underscore instead of space:

Pseudo-fwf: 1234 12345 123456 12345_ 123__4

Needs to be formatted as 123_4_ 12345_ 123456 12345_ 123_4

Then laf_open_fwf could handle it (which I'd prefer to use). If laf_open_fwf could throw a warning when something seems to be a poorly formatted fwf (maybe a test_fwf=T parameter), and there were a function to attempt an fwf formatting of a dataset...?

mike