datasets-io / male-first-names-us-frequency

Historical frequency of male first names in the population of U.S. births.
MIT License
3 stars 1 forks source link

Format #1

Open kgryte opened 9 years ago

kgryte commented 9 years ago

A few comments:

Planeshifter commented 9 years ago

I thought about using array of arrays, but because a lot of names have missing data ( = less than 5 people given that name in the respective year), many arrays will have a lot of null values and the resulting JSON file is much larger (roughly factor 4x). But this might not be as problematic when we split the data into individual files. I think lazy loading is a good idea, but am wondering how to implement this: Easiest way would be to just require the data for a given name once the respective property is accessed, which one could easily implement with a getter function. But since require is a synchronous, blocking call, this might not be the best route. Any ideas?

kgryte commented 9 years ago

Yeah, probably best not to implement the lazy loading here. Another module could sit on top of this one and provide getters which lazy load. This module could use fs rather than require and load either sync or async.

Re: nulls. The array of arrays need not be the same length; e.g., if John has been recorded every year since 1880 and Athan only since 1984, then John will have 135 values and Athan only 31 values.

In terms of matrices, I was thinking a matrix on a per name basis. To clarify, a matrix across the entire population would not be reasonable given the need to fill in the missing values.

kgryte commented 9 years ago

Lazy loading is not a critical feature; more food for thought in terms of how to structure and organize the data.

Planeshifter commented 9 years ago

Okay, thanks for the clarification re: array of arrays encoding. Sounds like a plan, will make the necessary changes.

kgryte commented 9 years ago

+1.