SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 139 forks source link

Open SQL databases by specifying file name #298

Closed v0dro closed 7 years ago

v0dro commented 7 years ago

Currently the .from_sql method requires a DBI::DatabaseHandle object for loading data from an SQL database. In case you have a database.db file in your system, it becomes cumbersome to load this data.

There should be support for loading a database from such a file with the Daru::DataFrame.from_sql method.

Example:

Daru::DataFrame.from_sql '/path/to/db/database.db', 'some SQL query string`
zverok commented 7 years ago

Probably related: https://github.com/SciRuby/daru/issues/138

athityakumar commented 7 years ago

@v0dro @zverok - Can we use ActiveRecord gem here to work with SQL database and collect query reposonses from the table as an object / list? After receiving an object of a table, we can probably then tweak the from_sql function of Daru::DataFrame module.

zverok commented 7 years ago

@athityakumar Well, I am definitely against adding ActiveRecord as a dependency for a Daru core. In attempt to support any DB possible, with direct dependency on some abstraction layer/ORM, probably companion gem like daru-db could be created. @v0dro, WDYT?..

(And, BTW, if to decide on what to depend for database access, I'd prefer much more lightweight Sequel library.)

v0dro commented 7 years ago

@zverok is correct. We should not add extra dependencies.

For this particular issue I think you can get away with just checking if the extension of the file name is .db. If that is the case, you can a library for loading the database from the file.

Daru currently uses dbi and sqlite3 as development dependencies. dbi can use different database adapters for loading data. If you can use dbi for loading from file that would be great so that the user only needs to install the appropriate adapter.

dshvimer commented 7 years ago

I am going to see what can be done with DBI. If that does not work, I think between sqlite3 and this dbf gem we can have a large portion of database file types covered.

In regards to a companion gem, how big would daru-db have to be? Would it need to support relational, no sql, and file based db's?

EDIT: Pull Request

zverok commented 7 years ago

As I've already wrote in PR,

you should first clarify its scope (by investigating "which formats of binary DB files are available on popular scientfic/governmental/business datasources), and only then the coding attempt.

E.g., it is a necessary step to check:

We'd be really happy if you could spend some (not much, yet decent) effort to answer and dicsuss those questions.

dshvimer commented 7 years ago

@zverok, I just wanted to make sure the PR was referenced here.

Other database file formats exist, namely: DBF, however they do not support SQL in the traditional sense. If we wanted to support them it does not make sense to add the functionality from within DataFrame.from_sql. There are utilities that allow querying of these formats with SQL like syntax, however it is just syntactic sugar. I took a look at how pandas handles these formats, and it does not. Instead there are utilities for converting these formats into CSV files.

If we want to add support for the DBF format for instance, I think it would make more sense to do so from a new class, not SqlDataSource.

A second item to note regarding pandas. It does not support loading a SQLite file by path. It's support is similar to what Daru already has implemented.

zverok commented 7 years ago

Now, @v0dro, it is probably a question to you: what use cases you've had in mind when creating this GitHub issue?

v0dro commented 7 years ago

Basically, I had a SQLite file database.db which I found hard to read into a DataFrame without a lot work and reading documentation.

An SQL database file is still just a database, and opening it in a dataframe should not be much of a hassle.

v0dro commented 7 years ago

@dshvimer I saw the pandas example of opening an SQLite database, and found it to be quite satisfactory. However, it is common practice to share test databases (mostly SQLite) on web in the form of .db files and opening at least SQLite files should be a trivial thing to do in daru, hence this issue.

@zverok I think we can get away by implementing read-from-file only for simple SQL databases, given the complexity of implementing a one-size-fits-all solution.

zverok commented 7 years ago

Basically, I had a SQLite file database.db which I found hard to read into a DataFrame without a lot work and reading documentation.

But was it SQLite? Or something else? That's what I am trying to understand -- is there case for data distributed in format of "sqlite db"?

v0dro commented 7 years ago

Yes. It was SQLite.

v0dro commented 7 years ago

This one specifically: https://github.com/pratapvardhan/pycon-2015/blob/master/data/aapdonation.db

zverok commented 7 years ago

OK, let's say that SQLite is our choice. I'll re-review everything from this point of view tomorrow.

dshvimer commented 7 years ago

If we need to load this directly from the web, I may need to make some changes.

May need to consider this issue as well

v0dro commented 7 years ago

No you don't need to load from the web. Its sufficient to expect the user to have downloaded the file onto their machine.

ananyo2012 commented 7 years ago

Can someone explain how the sql data source is tested? I don't see the sql database in the test suite but the db_name is used in the spec.

zverok commented 7 years ago

@ananyo2012 SQLite database is created dynamically on the fly: https://github.com/SciRuby/daru/blob/master/spec/support/database_helper.rb#L11

zverok commented 7 years ago

BTW, this issue should be considered closed :)