Format - Githubissues

kgryte commented 9 years ago

A few comments:

To make more amenable to matrices, I propose two changes:
- convert the years to numeric values
- make each name's value an array of arrays
```
...
"john": [[1881,3432],[1882,23243],...],
...
```
How can we break down the dataset to be more "requireable"? Suppose I only wanted j names. As the data is a single JSON, I have to load the entire dataset into memory and then filter. What if each name was stored as a separate JSON file in the lib directory? The index.js then would assemble all the names into a single object for export.
```
var files = fs.readdirSync( __dirname ),
  len = files.length,
  name,
  ext,
  f, i;

for ( i = 0; i < len; i++ ) {
f = files[ i ];
ext = f.substring( f.length-5 ); 
if ( ext === '.json' ) {
  name = f.substring( 0, f.length-5 );
  dataset[ name ] = require( path.join( __dirname, f ) );
}
}
```
...then, for the inclined, I would bypass the main export and require what I wanted.
```
var john = require( 'datasets-male-first-names-us-frequency/lib/john.json' );
```
Alternatively, maybe the dataset is not the main export? Should we allow lazy loading? I suppose that there could be a separate module which layers a loading API on top of this module.

Planeshifter commented 9 years ago

I thought about using array of arrays, but because a lot of names have missing data ( = less than 5 people given that name in the respective year), many arrays will have a lot of null values and the resulting JSON file is much larger (roughly factor 4x). But this might not be as problematic when we split the data into individual files. I think lazy loading is a good idea, but am wondering how to implement this: Easiest way would be to just require the data for a given name once the respective property is accessed, which one could easily implement with a getter function. But since require is a synchronous, blocking call, this might not be the best route. Any ideas?

kgryte commented 9 years ago

Yeah, probably best not to implement the lazy loading here. Another module could sit on top of this one and provide getters which lazy load. This module could use fs rather than require and load either sync or async.

Re: nulls. The array of arrays need not be the same length; e.g., if John has been recorded every year since 1880 and Athan only since 1984, then John will have 135 values and Athan only 31 values.

In terms of matrices, I was thinking a matrix on a per name basis. To clarify, a matrix across the entire population would not be reasonable given the need to fill in the missing values.

kgryte commented 9 years ago

Lazy loading is not a critical feature; more food for thought in terms of how to structure and organize the data.

Planeshifter commented 9 years ago

Okay, thanks for the clarification re: array of arrays encoding. Sounds like a plan, will make the necessary changes.

kgryte commented 9 years ago

+1.

datasets-io / male-first-names-us-frequency

Format #1