AlexMili / torch-dataframe

Utility class to manipulate dataset from CSV file
MIT License
67 stars 8 forks source link

Dataframe slow when loading large csv file #28

Closed ghostcow closed 7 years ago

ghostcow commented 8 years ago

Hello, I'm trying to load a 1.5GB csv file called 'train.csv'.

Here's what I type:

o = Dataframe()
o:load_csv{'../data/train.csv',verbose=true}

Then it outputs

<csv>   parsing file: ../data/train.csv 
<csv>   parsing done    

and it hangs there.

Help please?

ghostcow commented 8 years ago

if I interrupt it, I get:

^Cinterrupted!
stack traceback:
    [C]: in function 'match'
    ...ruzan/torch/install/share/lua/5.1/Dataframe/argcheck.lua:23: in function 'istype'
    [string "argcheck"]:42: in function 'assert_is_index'
    ...ll/share/lua/5.1/Dataframe/dataseries/sngl_elmnt_ops.lua:86: in function 'set'
    .../install/share/lua/5.1/Dataframe/dataframe/load_data.lua:95: in function 'load_csv'
    [string "_RESULT={o:load_csv('../data/train.csv')}"]:1: in main chunk
    [C]: in function 'xpcall'
    /home/lioruzan/torch/install/share/lua/5.1/trepl/init.lua:652: in function 'repl'
    ...uzan/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:199: in main chunk
    [C]: at 0x00406670  
ghostcow commented 8 years ago

recreated the problem with torch compiled with Lua 5.2 also

gforge commented 8 years ago

What version are you using? The 1.6 shouldn't have any issues with memory. I don't think the verbose helps that much as of 1.6 since csvigo returns an object that is later parsed, the done is emitted here: https://github.com/clementfarabet/lua---csv/blob/master/init.lua#L84

It's a little hard to know the cause of this without having access to your dataset. I've used it without issues on largish datasets (don't have the exact size available). I suggest you start with some basic debugging:

ghostcow commented 8 years ago

it's just some publicly available data from here: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv

about 11M rows.

I managed to load the data using csvigo's mode='large' flag easily, now trying to parse it myself. So far no luck.

trying to find like a quick and dirty way to make the data a Tensor, because i'm on a tight schedule.

gforge commented 8 years ago

What torch-dataframe version are you using? I'm not sure I have the time to properly look into this if you're on a tight schedule right now but it's great that the data is publicly available.

ghostcow commented 8 years ago

the latest one. 1.6 freshly installed.

I actually hadn't thought of downgrading, perhaps i'll do that first.

gforge commented 8 years ago

I don't think downgrading will help. It must be something strange in the dataset that's triggering this. I've uploaded to the dev-branch a version with true verbose printing. You should be able to pinpoint it better now.

ghostcow commented 8 years ago

well, looks like it's working properly. thanks for the help.

I'm getting about 10000 rows per 3-5 seconds. it'll be done in an hour and a half, roughly. that's too slow for me right now :\

I'll probably write something hacky with threads just for now, but if it works maybe we can implement it here.

gforge commented 8 years ago

Ok, good to know. A threaded solution is certainly welcome - although note that you only need to load the data once since you will save the dataframe into t7-format and then load it from there which should be faster.

ghostcow commented 8 years ago

woohoo, got it down from 1.5 hours to 32 seconds (45 with lua5.2)!

here's the script, very naive but works: https://gist.github.com/ghostcow/caa6c76f574079d610dc2a385ab08343

hope it helps!

gforge commented 8 years ago

Wow, we need to have a look at that implementation. I'm re-opening the issue with a new title

AlexMili commented 7 years ago

Just pushed a first try using threads loading a CSV file. I do not have a powerful enough machine right now. Could you please test on your machines and tell me if it looks good ?

ghostcow commented 7 years ago

I'm getting an error, not quite sure what's wrong

th> o:load_csv{'trnnov15.csv',verbose=true,nthreads=12}
[string "argcheck"]:6645: 
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  Dataframe.load_csv(self, path[, header][, schema][, separator][, skip][,
  verbose][, nthreads])

   Loads a CSV file into Dataframe using csvigo as backend

   ({
      self      = Dataframe  -- 
      path      = string     -- path to file
     [header    = boolean]   -- if has header on first line [default=true]
     [schema    = Df_Dict]   -- The column schema types with column names as keys
     [separator = string]    -- separator (one character) [default=,]
     [skip      = number]    -- skip this many lines at start of file [default=0]
     [verbose   = boolean]   -- verbose load [default=false]
     [nthreads  = number]    -- Number of threads to use to read the csv file [default=1]
   })

   Return value: self

   Got: Dataframe, table={ [number]=?, nthreads=number, verbose=boolean }
invalid arguments!
AlexMili commented 7 years ago

Try this :

o:load_csv{path='trnnov15.csv',verbose=true,nthreads=12}

(I had path=)

ghostcow commented 7 years ago

I ran this piece of code:

do
   tic = torch.tic()
   o:load_csv{path='trnnov15.csv',verbose=true,nthreads=12} 
   print(string.format('loading csv took %.03f hours with %d threads',torch.toc(tic)/3600,12))
end

And got this output.

[INFO] Loading CSV
<csv>   parsing file: trnnov15.csv
<csv>   parsing done
[INFO] End loading CSV
Estimation number of rows : 6778500
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Starting preprocessing
[INFO] Start of thread n°3
[INFO] Start of thread n°2
[INFO] Start of thread n°4
[INFO] Start of thread n°1
[INFO] Start of thread n°5
[INFO] Start of thread n°6
[INFO] Start of thread n°7
[INFO] Start of thread n°8
[INFO] Start of thread n°9
[INFO] Start of thread n°10
[INFO] Start of thread n°12
[INFO] Start of thread n°11
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing file: trnnov15.csv
<csv>   parsing done
[INFO] Start of processing in thread n°1 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°3 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°8 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°5 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°9 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°6 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°4 (size :564875)
<csv>   parsing done
[INFO] Start of processing in thread n°10 (size :564875) 
<csv>   parsing done
[INFO] Start of processing in thread n°12 (size :564875) 
<csv>   parsing done
[INFO] Start of processing in thread n°2 (size :564875)  
<csv>   parsing done
[INFO] Start of processing in thread n°11 (size :564875) 
<csv>   parsing done
[INFO] Start of processing in thread n°7 (size :564875)  
done
3645.3790929317
loading csv took 1.013 hours with 12 threads
ghostcow commented 7 years ago

This is still kind of slow, I don't really understand the structure of the Dataframe, are you using tds vectors/hashes for the data? Could this be the issue?

gforge commented 7 years ago

The data is a list containing of tds.Vec/tensor depending on the data type combined with a hash indicating where the missing data is. Each column is a Dataseries and that's where the storage happens.

AlexMili commented 7 years ago

Erf, so slow indeed :/

Thanks @gforge for the explanation. Here is a print of the dataset var to illustrate what @gforge said :

{
  Col C : 
    {
      data : DoubleTensor - size: 4
      _variable_type : "double"
      missing : tds.Hash[1]{
2 : true
}
    }
  Col B : 
    {
      data : DoubleTensor - size: 4
      _variable_type : "double"
      missing : tds.Hash[0]{
}
    }
  Col D : 
    {
      data : tds.Vec[4]{
    1 : nil
    2 : B
    3 : D
    4 : A
}
      _variable_type : "string"
      missing : tds.Hash[1]{
1 : true
}
    }
  Col A : 
    {
      data : IntTensor - size: 4
      _variable_type : "integer"
      missing : tds.Hash[0]{
}
    }
}
AlexMili commented 7 years ago

Ok after a comparison between time processing of each row by your gist and the current implementation gave me a speed factor of 100 :

Gist script : 0.00003
Dataframe : 0.001
AlexMili commented 7 years ago

With the bulk_load_csv it took around 5 seconds to load 20MB of your dataset. Hypothetically it should take around 7minutes for the entire dataset (with 4 threads). It is only a supposition but it looks good.

At the moment columns with characters (tds.Vec) aren't handled so it is not totally functional.

df:bulk_load_csv{path="./yellow_tripdata_2016-01-part.csv",nthreads=4,verbose=true}
ghostcow commented 7 years ago

Sorry I couldn't help more, I am quite busy these days. I loaded a subset of the whole file, containing 6M rows (1.1 GB) with 12 threads- it took about 43 seconds. Great work, thanks!

AlexMili commented 7 years ago

No worries, I didn't do much in the past months ^^"

Just added tds.Vec support, didn't see any significant differences. Here are my comparison functions :

function test1()
df = Dataframe()
 local tic = torch.tic()
 df:load_csv{path="./specs/data/yeellow_tripdata_2016-01-part.csv",verbose=true}
 print("Took : "..torch.toc(tic))
end

function test2()
 df = Dataframe()
 local tic = torch.tic()
 df:bulk_load_csv{path="./specs/data/yellow_tripdata_2016-01-part.csv",nthreads=4,verbose=true}
 print("Took : "..torch.toc(tic))
end