comses / miracle

Repeatable data analysis workflows for computational models
1 stars 3 forks source link

metadata extraction: set DataColumn.data_type to appropriate DataType #46

Closed alee closed 8 years ago

alee commented 8 years ago

DataColumn.data_type is currently being assigned Real for any real number by analyzer.py. The data model currently tries to distinguish between integer and floating point numbers - can you try to set those properly?

http://stackoverflow.com/questions/4541155/check-if-a-number-is-int-or-float may be useful

Currently the list of DataColumn DataTypes is:

bigint (integer) boolean decimal (float) text (string)

We can adjust this list as needed.. Thoughts?

cpritcha commented 8 years ago

Sure. I'll change the data types.

cpritcha commented 8 years ago

Should be fixed now

alee commented 8 years ago

Things like agent_id in the luxedemo example should be classified as an integer, as well as many of the variables in the runLog data group (runID, random, runNumber, msgOutLevel, worldx, worldy, etc. It looks like they all get thrown into the decimal / float bucket though.

cpritcha commented 8 years ago

When I look at the luxe example I imported I _agentid, runID, msgOutLevel, worldx and worldy are all classified as bigints. Could you run the unit tests and see if the _test_guesstype test succeeds?

cpritcha commented 8 years ago

I've put up some changes that definitely improve the parsing of decimal and bigint types but couldn't find replicate the misclassification of the columns you mention (runID, random, runNumber etc). If these changes do not help maybe you could take a look at the locale settings? Mine are

(.miracle)vagrant@webserver:/vagrant/django$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
alee commented 8 years ago

I thought I'd commented on this already but I think this is only an issue on my Arch dev box that is running gdal 2.0.x with Python 3 by default. I'll test this out on production very soon and close if it's working there.