Dataframe slow when loading large csv file

ghostcow commented 8 years ago

Hello, I'm trying to load a 1.5GB csv file called 'train.csv'.

Here's what I type:

o = Dataframe()
o:load_csv{'../data/train.csv',verbose=true}

Then it outputs

<csv>   parsing file: ../data/train.csv 
<csv>   parsing done

and it hangs there.

Help please?

ghostcow commented 8 years ago

if I interrupt it, I get:

^Cinterrupted!
stack traceback:
    [C]: in function 'match'
    ...ruzan/torch/install/share/lua/5.1/Dataframe/argcheck.lua:23: in function 'istype'
    [string "argcheck"]:42: in function 'assert_is_index'
    ...ll/share/lua/5.1/Dataframe/dataseries/sngl_elmnt_ops.lua:86: in function 'set'
    .../install/share/lua/5.1/Dataframe/dataframe/load_data.lua:95: in function 'load_csv'
    [string "_RESULT={o:load_csv('../data/train.csv')}"]:1: in main chunk
    [C]: in function 'xpcall'
    /home/lioruzan/torch/install/share/lua/5.1/trepl/init.lua:652: in function 'repl'
    ...uzan/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:199: in main chunk
    [C]: at 0x00406670

ghostcow commented 8 years ago

recreated the problem with torch compiled with Lua 5.2 also

gforge commented 8 years ago

What version are you using? The 1.6 shouldn't have any issues with memory. I don't think the verbose helps that much as of 1.6 since csvigo returns an object that is later parsed, the done is emitted here: https://github.com/clementfarabet/lua---csv/blob/master/init.lua#L84

It's a little hard to know the cause of this without having access to your dataset. I've used it without issues on largish datasets (don't have the exact size available). I suggest you start with some basic debugging:

Load a small dataset with perhaps 10 000 rows or similar
Drop certain columns and see if that helps
Increase the dataset until you hit the limit if you still haven't found the bug
Add print statements to the Extensions\load_data.lua in order to see where the error is

ghostcow commented 8 years ago

it's just some publicly available data from here: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv

about 11M rows.

I managed to load the data using csvigo's mode='large' flag easily, now trying to parse it myself. So far no luck.

trying to find like a quick and dirty way to make the data a Tensor, because i'm on a tight schedule.

gforge commented 8 years ago

What torch-dataframe version are you using? I'm not sure I have the time to properly look into this if you're on a tight schedule right now but it's great that the data is publicly available.

ghostcow commented 8 years ago

the latest one. 1.6 freshly installed.

I actually hadn't thought of downgrading, perhaps i'll do that first.

gforge commented 8 years ago

I don't think downgrading will help. It must be something strange in the dataset that's triggering this. I've uploaded to the dev-branch a version with true verbose printing. You should be able to pinpoint it better now.

ghostcow commented 8 years ago

well, looks like it's working properly. thanks for the help.

I'm getting about 10000 rows per 3-5 seconds. it'll be done in an hour and a half, roughly. that's too slow for me right now :\

I'll probably write something hacky with threads just for now, but if it works maybe we can implement it here.

gforge commented 8 years ago

Ok, good to know. A threaded solution is certainly welcome - although note that you only need to load the data once since you will save the dataframe into t7-format and then load it from there which should be faster.

ghostcow commented 8 years ago

woohoo, got it down from 1.5 hours to 32 seconds (45 with lua5.2)!

here's the script, very naive but works: https://gist.github.com/ghostcow/caa6c76f574079d610dc2a385ab08343

hope it helps!

gforge commented 8 years ago

Wow, we need to have a look at that implementation. I'm re-opening the issue with a new title

AlexMili commented 7 years ago

Just pushed a first try using threads loading a CSV file. I do not have a powerful enough machine right now. Could you please test on your machines and tell me if it looks good ?

ghostcow commented 7 years ago

I'm getting an error, not quite sure what's wrong

th> o:load_csv{'trnnov15.csv',verbose=true,nthreads=12}
[string "argcheck"]:6645: 
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  Dataframe.load_csv(self, path[, header][, schema][, separator][, skip][,
  verbose][, nthreads])

   Loads a CSV file into Dataframe using csvigo as backend

   ({
      self      = Dataframe  -- 
      path      = string     -- path to file
     [header    = boolean]   -- if has header on first line [default=true]
     [schema    = Df_Dict]   -- The column schema types with column names as keys
     [separator = string]    -- separator (one character) [default=,]
     [skip      = number]    -- skip this many lines at start of file [default=0]
     [verbose   = boolean]   -- verbose load [default=false]
     [nthreads  = number]    -- Number of threads to use to read the csv file [default=1]
   })

   Return value: self

   Got: Dataframe, table={ [number]=?, nthreads=number, verbose=boolean }
invalid arguments!

AlexMili commented 7 years ago

Try this :

o:load_csv{path='trnnov15.csv',verbose=true,nthreads=12}

(I had path=)

ghostcow commented 7 years ago

I ran this piece of code:

do
   tic = torch.tic()
   o:load_csv{path='trnnov15.csv',verbose=true,nthreads=12} 
   print(string.format('loading csv took %.03f hours with %d threads',torch.toc(tic)/3600,12))
end

And got this output.

[INFO] Loading CSV
<csv>   parsing file: trnnov15.csv
<csv>   parsing done
[INFO] End loading CSV
Estimation number of rows : 6778500
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Start of thread n°3
[INFO] Start of thread n°2
[INFO] Start of thread n°4
[INFO] Start of thread n°1
[INFO] Start of thread n°5
[INFO] Start of thread n°6
[INFO] Start of thread n°7
[INFO] Start of thread n°8
[INFO] Start of thread n°9
[INFO] Start of thread n°10
[INFO] Start of thread n°12
[INFO] Start of thread n°11
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing done
[INFO] Start of processing in thread n°1 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°3 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°8 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°5 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°9 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°6 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°4 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°10 (size :564875) 
<csv>   parsing done
[INFO] Start of processing in thread n°12 (size :564875) 
<csv>   parsing done
[INFO] Start of processing in thread n°2 (size :564875)  
<csv>   parsing done
[INFO] Start of processing in thread n°11 (size :564875) 
<csv>   parsing done
[INFO] Start of processing in thread n°7 (size :564875)  
done
3645.3790929317
loading csv took 1.013 hours with 12 threads

ghostcow commented 7 years ago

This is still kind of slow, I don't really understand the structure of the Dataframe, are you using tds vectors/hashes for the data? Could this be the issue?

gforge commented 7 years ago

The data is a list containing of tds.Vec/tensor depending on the data type combined with a hash indicating where the missing data is. Each column is a Dataseries and that's where the storage happens.

AlexMili commented 7 years ago

Erf, so slow indeed :/

Thanks @gforge for the explanation. Here is a print of the dataset var to illustrate what @gforge said :

{
  Col C : 
    {
      data : DoubleTensor - size: 4
      _variable_type : "double"
      missing : tds.Hash[1]{
2 : true
}
    }
  Col B : 
    {
      data : DoubleTensor - size: 4
      _variable_type : "double"
      missing : tds.Hash[0]{
}
    }
  Col D : 
    {
      data : tds.Vec[4]{
    1 : nil
    2 : B
    3 : D
    4 : A
}
      _variable_type : "string"
      missing : tds.Hash[1]{
1 : true
}
    }
  Col A : 
    {
      data : IntTensor - size: 4
      _variable_type : "integer"
      missing : tds.Hash[0]{
}
    }
}

AlexMili commented 7 years ago

Ok after a comparison between time processing of each row by your gist and the current implementation gave me a speed factor of 100 :

Gist script : 0.00003
Dataframe : 0.001

AlexMili commented 7 years ago

With the bulk_load_csv it took around 5 seconds to load 20MB of your dataset. Hypothetically it should take around 7minutes for the entire dataset (with 4 threads). It is only a supposition but it looks good.

At the moment columns with characters (tds.Vec) aren't handled so it is not totally functional.

df:bulk_load_csv{path="./yellow_tripdata_2016-01-part.csv",nthreads=4,verbose=true}

ghostcow commented 7 years ago

Sorry I couldn't help more, I am quite busy these days. I loaded a subset of the whole file, containing 6M rows (1.1 GB) with 12 threads- it took about 43 seconds. Great work, thanks!

AlexMili commented 7 years ago

No worries, I didn't do much in the past months ^^"

Just added tds.Vec support, didn't see any significant differences. Here are my comparison functions :

function test1()
df = Dataframe()
 local tic = torch.tic()
 df:load_csv{path="./specs/data/yeellow_tripdata_2016-01-part.csv",verbose=true}
 print("Took : "..torch.toc(tic))
end

function test2()
 df = Dataframe()
 local tic = torch.tic()
 df:bulk_load_csv{path="./specs/data/yellow_tripdata_2016-01-part.csv",nthreads=4,verbose=true}
 print("Took : "..torch.toc(tic))
end

AlexMili / torch-dataframe

Dataframe slow when loading large csv file #28