gdifazio / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

MySQL Support Would be Amazing #12

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Try to connect to a MySQL database

As a DB administrator, all of my data is locked up in a MySQL database. I
COULD export it to CSV, then import it into Gridworks, then do some
cleanup, then export it to CSV, then import it to MySQL again, overwriting
the old data, but that seems very complicated.

Since this works in the browser anyway, it would be amazing if it could be
connected to MySQL databases, and if data manipulation could happen from there.

This would unlock a TON of new data, though I'm unsure how the program
would scale to the quantities of information that would be pulled in.

Original issue reported on code.google.com by mjlissner on 10 May 2010 at 5:39

GoogleCodeExporter commented 9 years ago
That sounds more like an enhancement request than a defect report.

To generalize things a bit, support for a standard data access API would allow 
people 
to plug in multiple DB backends.

Original comment by tfmorris on 10 May 2010 at 7:26

GoogleCodeExporter commented 9 years ago
One of the major challenges here is how to support undo/redo when changes can 
get into 
the back-end database without going through Gridworks. Another major challenge 
is where 
to store metadata (such as reconciliation records) that is specific to 
Gridworks and 
not native to any existing back-end database.

Original comment by dfhu...@gmail.com on 10 May 2010 at 7:32

GoogleCodeExporter commented 9 years ago
Hmm...I presume a new project would have to be created when doing this, and that
could hold the meta information.

As for undo/redo, maybe adding a commit button would make that easier. So 
changes can
be made to a snapshot of the data, but then no changes are made to the DB itself
until commit is pressed?

Original comment by mjlissner on 10 May 2010 at 7:36

GoogleCodeExporter commented 9 years ago
You can either keep it synchronous with the database (effectively using the 
database 
as the backend), but lose undo/redo support and reconciliation with Freebase 
(unless 
you add suitable tables to the database).  You're then using Gridworks for just 
the 
facets really.

The other way, as you suggest, is to make it similar to a disconnected session 
with a 
commit transaction.  The issue is then ensuring consistency between the remote 
database and the snapshot held in Gridworks.  Merging the two back in would be 
an 
issue.  You'd also need to hold keys from the remote database in Gridworks for 
updating records.

Frameworks such as Hibernate or Spring would be worth considering for their 
database 
abstraction layers.

Original comment by iainsproat on 10 May 2010 at 7:46

GoogleCodeExporter commented 9 years ago
I have read that good ORMs such as Hibernate now support ordered lists and 
other 
features now incorporated into JPA 2.0 as of Dec 2009.

Original comment by thadguidry on 10 May 2010 at 8:05

GoogleCodeExporter commented 9 years ago

Original comment by iainsproat on 25 May 2010 at 7:59

GoogleCodeExporter commented 9 years ago

Original comment by iainsproat on 14 Oct 2010 at 9:26

GoogleCodeExporter commented 9 years ago
Initially, what I believe is most important/useful, is simply having the 
ability to direct-connect to MySQL/PostgreSQL/etc. data sources from the 
get-go(create a new project).  Also initially, being able to set/save multiple 
data sources, and multiple dBs within those sources.  When the user creates a 
new project, this would in effect create a disconnected session(as mentioned 
above), wherein the data is treated as is now the data.  Adding 'commit' 
features can come next, followed by more advanced connectivity options, until 
such a point synchronous functionality is in place to one degree or another.  
But for now, it would certainly be nice to add data sources and pull data from 
those sources!

Eric Jarvies

Original comment by 7...@ericjarvies.com on 11 Nov 2010 at 7:08

GoogleCodeExporter commented 9 years ago
Eric,  A cleanup tool using industry best practices is best used offline within 
a process.  There are existing ETL tools that easily consume from 
MySQL/PostgreSQL etc and offer excellent flow control, exporting, and 
connectivity to produce delimited files with relative ease.  Talend is one such 
product that I use along with Google Refine.  Talend (Open Source Community 
edition) does my scheduled daily gathering from 3 databases (MySQL and Oracle) 
and then dumps a customized TSV file that I open with Google Refine for further 
analysis and sometimes clean up.  There are other tools that provide ETL 
(Extract, Transform, Load) like Talend.  I'm not 100% sure if the team really 
feels the need to copy that and flesh out a full ETL platform, since Talend and 
other tools fill that need very nicely.  Incidentally, using Google Refine and 
a bit of clustering, I was able to find a few loop holes in our data storage 
processing that we fixed with a few stored procedures within Oracle.  Google 
Refine was instrumental as a discovery tool for that.  Talend does have an MDM 
component but does not have the interactivity of a discovery tool like Google 
Refine does.  If you do NOT need a daily process, but only one time cleanup, 
just dumping with MySQL or PostgreSQL would offer about the same and depending 
on the size of database takes only secs to minutes.  Dumping can also avoid 
potential live database locks, that if Refine supported might have to tip-toe 
around, depending on the teams' chosen implementation of database connectivity. 
 If you have large database size needs, give Talend or another ETL tool a try 
with Google Refine, and you'll soon see the powerful left-right combination.  
I'm not sure how far the team ultimately decides to absorb direct connectivity 
support within Google Refine.  I'd like to hear other opinions as well on this 
Issue-12.

Original comment by thadguidry on 11 Nov 2010 at 2:36

GoogleCodeExporter commented 9 years ago
I agree with thaguidry. Let the Refine team focus on bringing data quality 
issues to light. Let Talend focus on data Quality (they do have an data 
profiling tool that can identify some of this stuff) Talend is what we use for 
basic ETL. You could write some SQL to get the data out of MySQL anyway. If you 
analyze directly connected to db server for data quality against an entire 
large table your dba might become angry too.

C

Original comment by Chris.Go...@gmail.com on 14 Apr 2011 at 2:23