djsutherland / pummeler

Utilities to analyze ACS PUMS files, especially for distribution regression / ecological inference
MIT License
21 stars 7 forks source link

warnings/errors from featurize #20

Closed flaxter closed 7 years ago

flaxter commented 7 years ago

when I run:


# ./pummel featurize --seed 17 --subsets "DEAR == 1, DEYE == 1, DOUT == 1, DRAT == 2 | DRAT == 3 | DRAT == 4 | DRAT == 5, DREM == 1, ENG == 2, ENG == 3 | ENG == 4, FER == 1, GCL == 1, GCR == 1, HINS2 == 1, HINS3 == 1, HINS4 == 1, HINS5 == 1, HINS6 == 1, HINS7 == 1" --do-my-additive --skip-rbf regions regions/socio.npz 
  0% (      0 of 9222637) |                                                                                              | Elapsed Time: 0:00:00 ETA:  --:--:--/home/ubuntu/pummeler/pummeler/featurize.py:534: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['NAICSP'] = df.NAICSP.map(naics_cat, na_action='ignore')
/home/ubuntu/pummeler/pummeler/featurize.py:538: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['OCCP'] = df.OCCP.astype(float).map(occ_cat, na_action='ignore')
/home/ubuntu/pummeler/pummeler/featurize.py:544: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['FOD1P'] = df.FOD1P.map(fod_cats, na_action='ignore')
/home/ubuntu/pummeler/pummeler/featurize.py:546: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['FOD2P'] = df.FOD2P.map(fod_cats, na_action='ignore')
/home/ubuntu/pummeler/pummeler/featurize.py:549: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['ANYHISP'] = (df.HISP > 1).astype(int)
/home/ubuntu/pummeler/pummeler/featurize.py:551: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['HASDEGREE'] = (df.SCHL >= 20).astype(int)
/home/ubuntu/pummeler/pummeler/featurize.py:556: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  'hispanic').map(_ethnicity_map)
 14% (1360078 of 9222637) |##############                                                                                  | Elapsed Time: 0:02:50 ETA: 0:17:56

Also one of my featurization runs crashed at the end---maybe because I was out of tmp space? Do you know how to tell python to use a different tmp directory? Or maybe I'll just disable the savez_compressed and I just won't put these into Dropbox.

Traceback (most recent call last):
  File "./pummel", line 5, in <module>
    main()
  File "/home/ubuntu/pummeler/pummeler/cli.py", line 153, in main
    args.func(args, parser)
  File "/home/ubuntu/pummeler/pummeler/cli.py", line 185, in do_featurize
    np.savez_compressed(args.outfile, **res)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 600, in savez_compressed
    _savez(file, args, kwds, True)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 639, in _savez
    pickle_kwargs=pickle_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/format.py", line 584, in write_array
djsutherland commented 7 years ago

Warnings fixed in 78ac3bba3e43fe3ed571a5cbff68be33375f35b7; all the embeddings based on those variables are going to be wrong. It only shows up when your --subsets actually remove rows, which my test case last night didn't have.

The second problem is that numpy saves the arrays to $TMPDIR before compressing them. 2c0026081b1ea0080ed9d4bbb4b6c83491a8e3e3 makes it optional, off by default, with a note about TMPDIR; set it before running with --save-compressed.