cardillo / joinery

Data frames for Java
https://joinery.sh
GNU General Public License v3.0
692 stars 167 forks source link

Coercion methods #49

Open wbuchanan opened 8 years ago

wbuchanan commented 8 years ago

More of a question/potential enhancement request than anything. Basically, I was just wondering what it would take to create methods to coerce existing objects into a DataFrame object? I would imagine 2d Arrays would be fairly easy to handle (although I could be completely wrong). My hope was that as I get some other work wrapped up on some readers/parsers for Stata formatted files (as well as others in the future) it'd be possible to build the classes/methods around an idea of being able to coerce the data into a DataFrame (then there'd be the advantage of joins/unions of files from different statistical software platforms). Also, I haven't looked too much into the documentation yet, but if there is a way to retain any metadata with the file that would be helpful as well (e.g., variable labels (distinct from column names), value labels (e.g., analogous to descriptions in a look up table in a SQL database), etc...).

cardillo commented 8 years ago

There are currently methods to read and write csv and Excel files, generally these provide the interoperability I need. That said, I release they are rather low fidelity (i.e. they preserve column names but not much else). There are also methods to convert to 2d arrays, but not from. I think this would be a useful addition. Also, reading and writing other formats would be useful as well. I can take a look at adding these features or will gladly merge a pull request.

Variable labels might be a little more difficult, Joinery doesn't currently store any additional information about the individual data points. While this certainly could be added, it isn't as high a priority for me personally. But again, pull requests are welcome.

wbuchanan commented 8 years ago

The only working example I would have at the moment is some work I did on serializing data in memory to a JSON object using Stata's Java API https://github.com/wbuchanan/StataJSON. I've broken some of the work there into more generic classes here as well as trying to potentially test coercing some of the data to a DataFrame. There is a C library the could be helpful for parsing files from statistical packages, but I'm not terribly familiar with JNI or how the C library is working (https://github.com/WizardMac/ReadStat). I think once I can figure out how to get the data into a DataFrame object I could probably figure out how to get it into an object suitable for Stata.