18F / open-data-maker

make it easy to turn a lot of potentially large csv files into easily accessible open data
Other
200 stars 135 forks source link

Stats queries return integers as numbers, floats as strings #278

Open yozlet opened 8 years ago

yozlet commented 8 years ago

Example:

{
  "aggregations": {
    "school.tuition_revenue_per_fte": {
      "count": 7362,
      "min": 0,
      "max": 207213,
      "avg": "0.9954935343656614E4",
      "sum": 73288234,
      "sum_of_squares": 1249896256472,
      "variance": "0.7067598825743325E8",
      "std_deviation": "0.8406901228005076E4",
      "std_deviation_bounds": {
        "upper": "0.26768737799666764E5",
        "lower": "-0.6858867112353537E4"
      }
    }
  }
}

... whereas what we get when querying ES directly is:

{
  "aggregations": {
    "school.tuition_revenue_per_fte": {
      "count": 7362,
      "min": 0,
      "max": 207213,
      "avg": 9954.935343656614,
      "sum": 7.3288234E7,
      "sum_of_squares": 1.249896256472E12,
      "variance": 7.067598825743325E7,
      "std_deviation": 8406.901228005076,
      "std_deviation_bounds": {
        "upper": 26768.737799666764,
        "lower": -6858.867112353537
      }
    }
  }
}

I initially thought this was something to do with Ruby mis-parsing scientific notation, but now I suspect it's something in our type-detection output code.

ultrasaurus commented 8 years ago

I just discovered the same thing. For quick reference, here's a sample query:

http://localhost:3000/v1/schools/stats?school.degrees_awarded.predominant=3&_fields=school.instructional_expenditure_per_fte&_metrics=avg,sum,std_deviation,std_deviation_bounds

ultrasaurus commented 8 years ago

@siruguri would you have time to take a look at what's going on here?

siruguri commented 8 years ago

I have to re-install elastic search because I got a new laptop, and remind myself how to load up some dummy data... I can take a whirl at that this afternoon, and hopefully at least point out what the deal is... not sure if I can get around to fixing/testing it in the next few days...

On Wed, Feb 17, 2016 at 2:19 PM, Sarah Allen notifications@github.com wrote:

@siruguri https://github.com/siruguri would you have time to take a look at what's going on here?

— Reply to this email directly or view it on GitHub https://github.com/18F/open-data-maker/issues/278#issuecomment-185432915 .

siruguri commented 8 years ago

in controllers.rb, line 75, the .to_json calls .to_s on each item in the hash data. Floats are BigDecimals; so to_s makes them strings which I think is the behavior of BigDecimal, whereas the integers are Fixnums (maybe?).

So the solution would be either to intercept the to_s if possible in JSON, or post-process the output.

I am not too sure when I can get to this, maybe next week? I know I also have a failing test request on another issue.

On Wed, Feb 17, 2016 at 2:27 PM, Sameer Siruguri siruguri@gmail.com wrote:

I have to re-install elastic search because I got a new laptop, and remind myself how to load up some dummy data... I can take a whirl at that this afternoon, and hopefully at least point out what the deal is... not sure if I can get around to fixing/testing it in the next few days...

On Wed, Feb 17, 2016 at 2:19 PM, Sarah Allen notifications@github.com wrote:

@siruguri https://github.com/siruguri would you have time to take a look at what's going on here?

— Reply to this email directly or view it on GitHub https://github.com/18F/open-data-maker/issues/278#issuecomment-185432915 .

siruguri commented 8 years ago

So the writer of BigDecimal actually wanted the default implementation to be that floating points are rendered as strings by the to_json method, to reduce the possibility of (small) errors during de/re-serialization of data. As far as I could figure, it seems like monkey-patching BigDecimal is the only solution, with the caveat that it will produce this corner-case error.

Should I do that? Should we instead let the numbers be strings so that the client system can decide how to re-cast to their version of the floating-point type?

yozlet commented 8 years ago

Sorry to be so slow, @siruguri! Ugh, this is a nasty one, and I'm really torn. Providing floats as strings feels like giving a worse user/developer experience in return for a level of accuracy that - in all the existing use cases, anyway - really isn't needed. On the other hand, Open-data-maker is meant to be domain-agnostic, and I can't speak for the needs of future data sets and their users. (On the other other hand, I don't know what ES uses for its floats internally; the accuracy may already have been compromised before the stats were emitted.)

My inclination is to monkey-patch in favour of a better user experience. What does everyone else think?

siruguri commented 8 years ago

I'm open to either solution though I think I'd rather un-string the value in the open-data-maker code than monkey-patch BigDecimal. I'll work on this today.

On Fri, Mar 4, 2016 at 12:18 PM, Yoz Grahame notifications@github.com wrote:

Sorry to be so slow, @siruguri https://github.com/siruguri! Ugh, this is a nasty one, and I'm really torn. Providing floats as strings feels like giving a worse user/developer experience in return for a level of accuracy that - in all the existing use cases, anyway - really isn't needed. On the other hand, Open-data-maker is meant to be domain-agnostic, and I can't speak for the needs of future data sets and their users. (On the other other hand, I don't know what ES uses for its floats internally; the accuracy may already have been compromised before the stats were emitted.)

My inclination is to monkey-patch in favour of a better user experience. What does everyone else think?

— Reply to this email directly or view it on GitHub https://github.com/18F/open-data-maker/issues/278#issuecomment-192448370 .

siruguri commented 8 years ago

In fact, now that I think about it,

Providing floats as strings feels like giving a worse user/developer experience in return for a level of accuracy that - in all the existing use cases, anyway - really isn't needed.

sounds equiv to, "let's cast the BigDecimal object via .to_f" Should I do that instead? Thoughts? Then .to_json will continue to produce non-stringified numbers.

yozlet commented 8 years ago

@siruguri Yes, that sounds like the right option to me. Thank you!

siruguri commented 8 years ago

303