dehesa / CodableCSV

Read and write CSV files row-by-row or through Swift's Codable interface.
MIT License
456 stars 73 forks source link

Is encoding Double or Float supported? #20

Closed leisurehound closed 4 years ago

leisurehound commented 4 years ago

When I change the floatStrategy as .convert I get an fatal error in this code:

            case .throw: throw CSVEncoder.Error._invalidFloatingPoint(value, codingPath: self.codingPath)
            case .convert(let positiveInfinity, let negativeInfinity, let nan):
                if value.isNaN {
                    return nan
                } else if value.isInfinite {
                    return (value < 0) ? negativeInfinity : positiveInfinity
                } else { fatalError() }
            }

So with either strategy I either get the thrown error or a fatal error if a valid Double is attempted to be encoded. Is this expected or is there another configuration I'm missing?

Similarly, when I try to decode a double, I get an error thrown, but when I decode it as a string and convert the string to a double in my structs init(from: Decoder) I process the field correctly.

dehesa commented 4 years ago

Hi @leisurehound

It seems I introduced a bug encoding/decoding Doubles and Floats in one of the last releases and my tests were only covering integer numbers (and not floating-point ones).

It is an easy fix. I will have a patch ready in a couple of hours. Thank you for reporting.

dehesa commented 4 years ago

I have fixed the floating-point problem and added some tests to check correct behavior. The fixes can be found in the develop branch.

Please, give it a try and if it solves your problems, I will promote the fix to master and release a new patch version.

dehesa commented 4 years ago

Oh, and I forgot to mention that I have renamed floatStrategy to nonConformingFloatStrategy. This way, it makes it more clear that the encoding/decoding strategy only gives the option to modify non conforming floating-point values (i.e. infinity, NaN).

leisurehound commented 4 years ago

Thanks @dehesa.

The following statement now works:

duration = try container.decode(Double.self)

but the following does not:

duration = try container.decodeIfPresent(Double.self)

dehesa commented 4 years ago

Uhm, ok. That is interesting. Could you provide a small example with the behavior you are expecting, such as the one I have in the tests. It will help me out debugging.

leisurehound commented 4 years ago

Sorry, am not sure if my Codable experience allows me to replicate without the container that I'm using within the init(from: Decoder).

I'm trying to decode an enum property in the parent struct (see below) by decoding each field individually. I'd really like to be able to decode the Codable enum like I can with JSON, but the new container index seems to get mixed up and always fails.

I'm essentially serializing an enum which has one case with an associated value and the other does not (its essentially a simplified Result type). Here's how I'd expect it to work (but doesn't):

struct myStruct : Codable {
  let dateSent: Date
  let response: Response
  init(from decoder: Decoder) throws {
    var container = try decoder.container.unkeycontainer()
    // ... decode the date, which works fine...
    response = try container.decode(Response.self)
  }
}

enum Response : Codable {
  case time(duration: Double)
  case failure
  init(from decoder: Decoder) {
    var container = try decoder.container.unkeyedcontainer()
    let duration = try decoder.decodeIfPresent(Double.self)
    let failure = try decoder.decodeIfPresent(String.self)
    switch (duration, failure) {
      case (.some(let duration), .none): self = .time(duration: duration)
      case (.none, .some) : self = .failure
      case (.some, .some) : throw DecodingError.dataCorrupted(DecodingError.Context(codingPath: [], debugDescription: "Both Duration & Failure populated"))
      case (.none, .none) : throw DecodingError.dataCorrupted(DecodingError.Context(codingPath: [], debugDescription: "Both Duration & Failure empty"))
    }
}

Here's what I've had to resort to because of the container index becomes problematic when moving to the embedded enum, but it also doesn't work because of the decodeIfPresent(Double.self) always throws.

struct myStruct : Codable {
  let dateSent: Date
  let response: Response
  init(from decoder: Decoder) throws {
    var container = try decoder.container.unkeycontainer()
    // ... decode the date, which works fine...
    // response = try container.decode(Response.self)
    let duration = try decoder.decodeIfPresent(Double.self)
    let failure = try decoder.decodeIfPresent(String.self)
    switch (duration, failure) {
      case (.some(let duration), .none): response = .time(duration: duration)
      case (.none, .some) : response = .failure
      case (.some, .some) : throw DecodingError.dataCorrupted(DecodingError.Context(codingPath: [], debugDescription: "Both Duration & Failure populated"))
      case (.none, .none) : throw DecodingError.dataCorrupted(DecodingError.Context(codingPath: [], debugDescription: "Both Duration & Failure empty"))
    }
  }
}

enum Response {
  case time(duration: Double)
  case failure
}

I'm expecting (and encoding to) the following string records <datestring>,3.23423, or <datestring>,,failure but am struggling to decode the same

Sorry if this is fundamentally the wrong approach here, but its what I'm coming from the experience I have coding/decoding JSON with Codable.

dehesa commented 4 years ago

Alright. Let me take a look at it. I have a further question, though. Is the CSV schema closed? I mean, is that row definition final? <datestring>,<float>, and <datestring>,<empty>,failure or would you be open to define it in another way? e.g. <datestring>,<optionalFloat>

leisurehound commented 4 years ago

For this simple problem then yes, that's doable. I'm basically working on a dummy project to learn some aspects of SwiftUI plotting, Combine and Coding CSV values.

The real project I plan to use all this on, however, is managing an assortment time series that are sampled at various frequencies. So column positions are important to determine if there is a datum at any particular sample time. Time series CSV is generally columnar data and in the exports I'll have to process I won't have control of the collapsing of columns in particular rows, they'll simply be empty.

So, for the dummy project yes its easy to adapt. But the dummy project has a pattern I've chosen specifically as its the construct I'll have to support when I get past learning/prototyping (I'm actually a believer in a different second system effect, the first time I don't know what I'm doing and it ends up ugly, unmaintainable, and often inefficient. When I do it for the second time I have the knowledge to do it right and am not left with refactoring artifacts I wish weren't there...so thats why I'm constraining this work a bit more than necessary...)

dehesa commented 4 years ago

I fixed a small behavior on decodeNil, which is specified in the documentation, but I didn't account for; i.e. if nil is encountered, then +1, must be added to the index; but if not, it should not. With this fix, now decodeIfPresent should work as intended.

For your specific problem, you actually have several solutions (be sure to download the latest from develop branch):

dehesa commented 4 years ago

Let me know if that works for you and then I will upstream the patch to master. Again, thank you for reporting.

leisurehound commented 4 years ago

Yes, I can decodeIfPresent() successfully now.

One last question, I encodeRow() one result at a time (because they're a live time series) thus am using lazy. After each encoding I endEncoding() and then write the data with outputStream.writeBytes(). I've learned that I have to re-create a new encoder for the next datum after I've endEncoding() on the last live time series datum. Is there a manner in which I can reuse an encoder or is creating a new encoder cheap enough on every live datum that is collected?

dehesa commented 4 years ago

That is not ideal. Creating a new encoder is not "very expensive", but you are creating and deallocating OutputStreams for each row (which is not the intended usage).

Why don't you just store the lazy encoder somewhere (it is a reference type) and reuse when a new row arrives?

// 1. Create lazy encoder and store it somewhere.
self.cacheEncoder = CSVEncoder {
    $0.bufferingStrategy = .sequential
}.lazy(into: /* Whatever you need */)

// 2. Then, every time a new row arrives.
try self.cacheEncoder.encodeRow(customStructure)

// 3. Once you stop the live subscription, call
try self.cacheEncoder.endEncoding()
// there is also a defined deinit, which will `endEncoding()` for you (if you haven't called it) when the reference is deallocated.
leisurehound commented 4 years ago

I did try to store the encoder as a class property, but endEncoding on each datum so the file is tail-able, for example. This caused the next datum encoding to fail. For a long running process with a live feed, the data persistence is paramount, so I keep my output stream tied to a file open and continuously append to it, and then close the output stream on deinit, which could be never (or at server reboot, or when files roll). Its somewhat akin to using CSV encoding for a logger, I guess, which may be a bit corrupt. But data loggers do that by nature, and CSV is a natural export format for columnar data of time series data.

Would an option to flush the encoder instead of end it be viable here? The other option is for the client code do so on a Timer, letting the buffer grow and end encoding then create a new encoder again.

I think this may simply be a case where encoding is not worth the effort beyond:

let data = "\(dateStr),\(duratitonStr),\(failureStr)".data(using: .utf8)

dehesa commented 4 years ago

If I understood everything correctly, I believe you have three options to satisfy your requirements.

  1. Create a lazy encoder with .sequential buffering strategy.

    Store the lazy encoder and call encodeRow(customType). The sequential buffering strategy only keeps in memory the last row to be written. Therefore, when a new row arrives, it writes the previous row. That wouldn't do what you want, but if the type being encoded uses a unkeyedContainer, the fields are written sequential and the encoder is actually writing in the file the fields as they arrive (and doesn't keep it in memory).

    So to sum up, you would use a lazy encoder, call encodeRow every time a new row arrives (do not call endEncoding afterwards), and ensures that you are using unkeyedContainers and you are encoding all fields in the row (including empty fields). Then it will have the behavior you want.

  2. Use the imperative CSVWriter.

    This writer do exactly what you tell it to do. Thus, if you call in the writer's write(row:) it actually writes that row in the file including the row delimiter.

  3. Just append data to the file as you were suggesting.

    The upside of using an encoder or writer, is that they have an easier-to-understand API, they will handle any edge case (such as CSV escaping, etc.) and they free you from muddling with file handles. Still, sometimes using low-level APIs might be the solution if your data format never changes.

leisurehound commented 4 years ago

ok, thanks the details and the help here. Those details were not gleamed from your readme page when I first attempted to integrate the library. I may send you a PR for the README to expand on usage so its easier for first time users who may be new to CSV coding; what's there now seems to have some unstated assumptions/knowledge that I certainly didn't have. I'll admit, I had to trap errors to figure out what I needed to do or was doing wrong, which is not an efficient means and even still wasn't/am not really using it correctly without what you thus shared here.

dehesa commented 4 years ago

I agree there might be some assumptions in the README about how to use the library. I have spent so much time on it that some knowledge is second nature to me. I would be delighted if you send PRs my way; having a different view on how to express things is always valuable.

I will close this issue now, but feel free to open new ones if you have any more questions or you would like to improve something. If you decide to get involved in the project, the following things help: