Closed ghostcow closed 7 years ago
if I interrupt it, I get:
^Cinterrupted!
stack traceback:
[C]: in function 'match'
...ruzan/torch/install/share/lua/5.1/Dataframe/argcheck.lua:23: in function 'istype'
[string "argcheck"]:42: in function 'assert_is_index'
...ll/share/lua/5.1/Dataframe/dataseries/sngl_elmnt_ops.lua:86: in function 'set'
.../install/share/lua/5.1/Dataframe/dataframe/load_data.lua:95: in function 'load_csv'
[string "_RESULT={o:load_csv('../data/train.csv')}"]:1: in main chunk
[C]: in function 'xpcall'
/home/lioruzan/torch/install/share/lua/5.1/trepl/init.lua:652: in function 'repl'
...uzan/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:199: in main chunk
[C]: at 0x00406670
recreated the problem with torch compiled with Lua 5.2 also
What version are you using? The 1.6 shouldn't have any issues with memory. I don't think the verbose helps that much as of 1.6 since csvigo
returns an object that is later parsed, the done is emitted here: https://github.com/clementfarabet/lua---csv/blob/master/init.lua#L84
It's a little hard to know the cause of this without having access to your dataset. I've used it without issues on largish datasets (don't have the exact size available). I suggest you start with some basic debugging:
Extensions\load_data.lua
in order to see where the error isit's just some publicly available data from here: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv
about 11M rows.
I managed to load the data using csvigo's mode='large' flag easily, now trying to parse it myself. So far no luck.
trying to find like a quick and dirty way to make the data a Tensor, because i'm on a tight schedule.
What torch-dataframe version are you using? I'm not sure I have the time to properly look into this if you're on a tight schedule right now but it's great that the data is publicly available.
the latest one. 1.6 freshly installed.
I actually hadn't thought of downgrading, perhaps i'll do that first.
I don't think downgrading will help. It must be something strange in the dataset that's triggering this. I've uploaded to the dev-branch a version with true verbose printing. You should be able to pinpoint it better now.
well, looks like it's working properly. thanks for the help.
I'm getting about 10000 rows per 3-5 seconds. it'll be done in an hour and a half, roughly. that's too slow for me right now :\
I'll probably write something hacky with threads just for now, but if it works maybe we can implement it here.
Ok, good to know. A threaded solution is certainly welcome - although note that you only need to load the data once since you will save the dataframe into t7-format and then load it from there which should be faster.
woohoo, got it down from 1.5 hours to 32 seconds (45 with lua5.2)!
here's the script, very naive but works: https://gist.github.com/ghostcow/caa6c76f574079d610dc2a385ab08343
hope it helps!
Wow, we need to have a look at that implementation. I'm re-opening the issue with a new title
Just pushed a first try using threads loading a CSV file. I do not have a powerful enough machine right now. Could you please test on your machines and tell me if it looks good ?
I'm getting an error, not quite sure what's wrong
th> o:load_csv{'trnnov15.csv',verbose=true,nthreads=12}
[string "argcheck"]:6645:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Dataframe.load_csv(self, path[, header][, schema][, separator][, skip][,
verbose][, nthreads])
Loads a CSV file into Dataframe using csvigo as backend
({
self = Dataframe --
path = string -- path to file
[header = boolean] -- if has header on first line [default=true]
[schema = Df_Dict] -- The column schema types with column names as keys
[separator = string] -- separator (one character) [default=,]
[skip = number] -- skip this many lines at start of file [default=0]
[verbose = boolean] -- verbose load [default=false]
[nthreads = number] -- Number of threads to use to read the csv file [default=1]
})
Return value: self
Got: Dataframe, table={ [number]=?, nthreads=number, verbose=boolean }
invalid arguments!
Try this :
o:load_csv{path='trnnov15.csv',verbose=true,nthreads=12}
(I had path=
)
I ran this piece of code:
do
tic = torch.tic()
o:load_csv{path='trnnov15.csv',verbose=true,nthreads=12}
print(string.format('loading csv took %.03f hours with %d threads',torch.toc(tic)/3600,12))
end
And got this output.
[INFO] Loading CSV
<csv> parsing file: trnnov15.csv
<csv> parsing done
[INFO] End loading CSV
Estimation number of rows : 6778500
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Start of thread n°3
[INFO] Start of thread n°2
[INFO] Start of thread n°4
[INFO] Start of thread n°1
[INFO] Start of thread n°5
[INFO] Start of thread n°6
[INFO] Start of thread n°7
[INFO] Start of thread n°8
[INFO] Start of thread n°9
[INFO] Start of thread n°10
[INFO] Start of thread n°12
[INFO] Start of thread n°11
<csv> parsing file: trnnov15.csv
<csv> parsing file: trnnov15.csv
<csv> parsing file: trnnov15.csv
<csv> parsing file: trnnov15.csv
<csv> parsing file: trnnov15.csv
<csv> parsing file: trnnov15.csv
<csv> parsing file: trnnov15.csv
<csv> parsing file: trnnov15.csv
<csv> parsing file: trnnov15.csv
<csv> parsing file: trnnov15.csv
<csv> parsing file: trnnov15.csv
<csv> parsing file: trnnov15.csv
<csv> parsing done
[INFO] Start of processing in thread n°1 (size :564875)
<csv> parsing done
[INFO] Start of processing in thread n°3 (size :564875)
<csv> parsing done
[INFO] Start of processing in thread n°8 (size :564875)
<csv> parsing done
[INFO] Start of processing in thread n°5 (size :564875)
<csv> parsing done
[INFO] Start of processing in thread n°9 (size :564875)
<csv> parsing done
[INFO] Start of processing in thread n°6 (size :564875)
<csv> parsing done
[INFO] Start of processing in thread n°4 (size :564875)
<csv> parsing done
[INFO] Start of processing in thread n°10 (size :564875)
<csv> parsing done
[INFO] Start of processing in thread n°12 (size :564875)
<csv> parsing done
[INFO] Start of processing in thread n°2 (size :564875)
<csv> parsing done
[INFO] Start of processing in thread n°11 (size :564875)
<csv> parsing done
[INFO] Start of processing in thread n°7 (size :564875)
done
3645.3790929317
loading csv took 1.013 hours with 12 threads
This is still kind of slow, I don't really understand the structure of the Dataframe, are you using tds vectors/hashes for the data? Could this be the issue?
The data is a list containing of tds.Vec/tensor depending on the data type combined with a hash indicating where the missing data is. Each column is a Dataseries and that's where the storage happens.
Erf, so slow indeed :/
Thanks @gforge for the explanation. Here is a print of the dataset
var to illustrate what @gforge said :
{
Col C :
{
data : DoubleTensor - size: 4
_variable_type : "double"
missing : tds.Hash[1]{
2 : true
}
}
Col B :
{
data : DoubleTensor - size: 4
_variable_type : "double"
missing : tds.Hash[0]{
}
}
Col D :
{
data : tds.Vec[4]{
1 : nil
2 : B
3 : D
4 : A
}
_variable_type : "string"
missing : tds.Hash[1]{
1 : true
}
}
Col A :
{
data : IntTensor - size: 4
_variable_type : "integer"
missing : tds.Hash[0]{
}
}
}
Ok after a comparison between time processing of each row by your gist and the current implementation gave me a speed factor of 100 :
Gist script : 0.00003
Dataframe : 0.001
With the bulk_load_csv
it took around 5 seconds to load 20MB of your dataset. Hypothetically it should take around 7minutes for the entire dataset (with 4 threads). It is only a supposition but it looks good.
At the moment columns with characters (tds.Vec) aren't handled so it is not totally functional.
df:bulk_load_csv{path="./yellow_tripdata_2016-01-part.csv",nthreads=4,verbose=true}
Sorry I couldn't help more, I am quite busy these days. I loaded a subset of the whole file, containing 6M rows (1.1 GB) with 12 threads- it took about 43 seconds. Great work, thanks!
No worries, I didn't do much in the past months ^^"
Just added tds.Vec
support, didn't see any significant differences. Here are my comparison functions :
function test1()
df = Dataframe()
local tic = torch.tic()
df:load_csv{path="./specs/data/yeellow_tripdata_2016-01-part.csv",verbose=true}
print("Took : "..torch.toc(tic))
end
function test2()
df = Dataframe()
local tic = torch.tic()
df:bulk_load_csv{path="./specs/data/yellow_tripdata_2016-01-part.csv",nthreads=4,verbose=true}
print("Took : "..torch.toc(tic))
end
Hello, I'm trying to load a 1.5GB csv file called 'train.csv'.
Here's what I type:
Then it outputs
and it hangs there.
Help please?