Open GoogleCodeExporter opened 9 years ago
Hi there,
That is quite a large PSV, and it's possible that alone could be causing
problems - CSV import should batch queries and attempt to keep memory use low,
but that's quite a lot of rows :)
I believe the pipe character should work, and a few simple imports I've just
tested have worked fine.
At what stage do you get the "unexpected error"? What should happen after you
hit the import button on the file selection dialog is that it should read and
parse the file until it has a hundred rows, and then show you the field
mapping/table creation dialog (which is very powerful if not immediately
obvious!). It sounds like it's not even getting that far for you - is that
correct?
It sounds like it might be choking on the file itself. We only support
transparent import for the raw text, GZIP or BZIP2 compression. The automatic
encoding detection could also be running into problems, and so tripping up the
rest.
What's the exact error message? I can't find anything about an "unexpected
error", I'm afraid.
Original comment by rowanb@gmail.com
on 14 Feb 2013 at 12:50
Hi and thanks for your response.
I realize it's huge and may not work locally, but I don't have access to a
large enough server to handle the request. Please let me know if you think I'm
going about this the wrong way.
Your hunch is correct that the error is occurring immediately. I decompressed
a .tar.gz file which gave me the PSV. Based on your comment of only supporting
transparent import for GZIP, do you think that's worth a shot?
The message did indeed read that processing stopped due to an "unexpected
error".
Original comment by j.capu...@closedloopadvisors.com
on 14 Feb 2013 at 2:35
There shouldn't be any difference between importing a .gzip and the
uncompressed file, so if you already have the uncompressed file I don't think
that would improve matters.
I can't find any mention of "Unexpected error" in our codebase, nor of that of
the encoding detection. Is that the title or detail part of the error message,
or is the message only that? If so it sounds almost like a system-level
error...
How large is the file, in terms of megs/gigs?
The other thing to try is to grab just the first hundred lines or so (perhaps
using `head -n 100 path/to/file.psv > testfile.psv`) to simplify some of the
issues here :)
Original comment by rowanb@gmail.com
on 14 Feb 2013 at 11:42
I've attached a snapshop of the error I receive, which is a pop-up box.
The file is 5.8GB. I know it's pretty large, but a friend had recommended
attempting to run this on my machine before trying to find server space
somewhere.
I tried to grab the first hundred lines of code as you advised and must've done
something wrong as I received this message: You have an error in your SQL
syntax; check the manual that corresponds to your MySQL server version for the
right syntax to use near 'head -n 100
Macintosh_HD/Users/jdcap427/CLA/Offline_work/etsy_transaction/etsy_t' at line 1
The 100 was highlighted in red font in the query space above.
Thanks!
Original comment by j.capu...@closedloopadvisors.com
on 14 Feb 2013 at 1:43
Attachments:
Ohh... an exception! Sorry, yes, that's an actual hard bug you're running
into, not just an import failure. Could you just add "1621" to that problem
report description box and submit one?
I wonder if some of the file routines we're using are <2GB only - we can
address that.
The `head -n 100 path/to/file.psv > testfile.psv` is a command to use in
Terminal (or another command line util) to generate a small file from the start
of your large one :)
Original comment by rowanb@gmail.com
on 14 Feb 2013 at 1:56
Ok. Just sent the error report with 1621 in the description box.
I listed the altered command because it didn't work when I ran it and perhaps I
made an error when adding the path or something.
Thanks again for your help on this. Let me know what to do after you receive
the report.
Original comment by j.capu...@closedloopadvisors.com
on 14 Feb 2013 at 2:03
From the error you gave it sounded like you were trying to run the altered
command in a MySQL prompt/custom query, whereas I meant it to be run in a
standard unix/OS X shell, sorry.
It looks like the error occurred because it's trying to automatically create a
table syntax as a result of the file import - because there's no current table
to import into - and choking on some of the data. I'm about to have lunch, so
I might take a look at this then.
Original comment by rowanb@gmail.com
on 14 Feb 2013 at 2:20
Hi there,
I've committed a fix to that particular issue in r4021. I'm attaching a build
which is the 1.0.1 release but with that change added if you'd like to give it
a go.
When you run the import now you should now see the field mapping sheet pop up
after you've selected the file - as it needs to create a table to import the
data into, note the field at the top right where it's prompting you for a table
name. After that you can start the import - I'm intrigued to know how it'll do
for such a large file...
Original comment by rowanb@gmail.com
on 18 Feb 2013 at 1:13
Attachments:
I'll download the new release and run the job overnight.
Part of this is user error due to my unfamiliarity with your program. I added
a table to my DB and attempted an import beginning Friday at 10:30 pm EST.
It's now Sunday at 10 pm and I just cancelled the job after it churned for 2
days. No errors and the index reflected one record showing the primary key.
Original comment by j.capu...@closedloopadvisors.com
on 18 Feb 2013 at 3:10
What should happen is that the rows are pulled out of the CSV file and sent to
the server in batches, so there's hopefully already quite a lot of rows in the
table? Re-importing the file will generate an error for all the rows that
already exist, so the import will be much slower the second time around unless
you truncate the table first (each key error will make it redo the batch
line-by-line to get the full error, so it'll be ~100x slower for the old rows!).
Keep an eye on the memory use of Sequel Pro as it imports - if it balloons, the
speed will drop precipitously as it goes on, which could explain the two day
import. If so we'll need to do more work this side... the alternative is just
that the server is taking a while to save all the data!
(For a while this large, without any complex per-cell data mapping or changes,
you may also want to investigate the LOAD DATA INFILE command on the server -
http://dev.mysql.com/doc/refman/5.1/en/load-data.html - much less config, but
avoids network latency and query overhead...)
Original comment by rowanb@gmail.com
on 18 Feb 2013 at 11:59
As noted in my 1st response, based on a friend's suggestion, I'm attempting to
run this locally (4 GB of RAM) before renting server space. I now am convinced
that was bad advice. He seemed to justify the approach... Anyway, it looks
like I need to find server space.
As you might guess, the activity monitored showed 85% of CPU use when I force
closed Sequel Pro last night. Hitting the "cancel" button had no effect even
after more than an hour of waiting.
Once I figure out a server setup, I'll let you know how this works out. Thanks
for the tips on truncating the table the server command.
Original comment by j.capu...@closedloopadvisors.com
on 18 Feb 2013 at 1:47
It really depends on how much work you'll be doing with the data - will it be
analysis of the entire data set at once, or will you be querying parts of the
data to work on subsets at a time? If you're only working with bits of the
data, and can generate indexes on the appropriate columns you'll be querying
by, 4GB should be fine as it'll only really need to cache the indexes in RAM.
If you're working with the entire data set at once though... :)
I'll do some more tests with a massive CSV file locally - the 85% could be fine
(running queries!), or it could be bad (out of memory, all churning against
disk).
Original comment by rowanb@gmail.com
on 18 Feb 2013 at 1:59
This is all on the import attempt! I do have access to a machine with an 8 GB
RAM machine if you think that would work to import this table.
Work involved will include joining columns from another table, followed by
if-then statements that will define new variables, then I'll need to run
summary stats (the latter two steps in R).
Original comment by j.capu...@closedloopadvisors.com
on 18 Feb 2013 at 2:47
Original comment by stuart02
on 25 Oct 2013 at 11:49
Original issue reported on code.google.com by
j.capu...@closedloopadvisors.com
on 13 Feb 2013 at 9:52