fsprojects / FSharp.Data

F# Data: Library for Data Access
https://fsprojects.github.io/FSharp.Data
Other
815 stars 287 forks source link

Performance issue with CSV typeprovider #547

Closed carsten-j closed 10 years ago

carsten-j commented 10 years ago

When I read the attached CSV file which contains 785 columns and 113 rows (including header row) then the following two lines of code executes really slow:

type trainingSet = CsvProvider<"Data/trainSmall.csv", ",", CacheRows=false>
let data = trainingSet.Load("Data/trainSmall.csv")

When I sent the first line to the F# interactive it returns in about 10 seconds whereas when I sent the second line of code to the F# interactive it takes more than 5 minutes before the interactive prompt replies.

I am running the code on my MacBook Pro from 2013 with a 2.6 GHz I5 processor and 16GB ram using F# 3.0 and Xamarin Studio. I have tried the same experiment with Windows7 / VS2013 running under a VM on the same hardware. The results are comparable. When I use the same machine and try to do the exact same thing with R it is so fast that I cannot time it with an ordinary watch.

https://dl.dropboxusercontent.com/u/13678102/Script.fsx https://dl.dropboxusercontent.com/u/13678102/trainSmall.csv

veikkoeeva commented 10 years ago

As an added note, compiling and runnig the following is quick

open FSharp.Data

[<EntryPoint>]
let main argv = 
    let data = CsvFile.Load("trainSmall.csv")

    for row in data.Rows do
        printfn "%s, %s" (row.GetColumn "pixel99") (row.GetColumn "pixel783")

0

wherein the following takes quite some time to compile, but runs quickly

open FSharp.Data

type trainingSet = CsvProvider<"C:/projektit/FsharpDataPerformance/FsharpDataPerformance/trainSmall.csv", ",", CacheRows=false>

[<EntryPoint>]
let main argv = 
    let data = trainingSet.Load("trainSmall.csv")

    for row in data.Rows do        
        printfn "%i, %i" row.pixel99 row.pixel783

0

Maybe something to do with reflection. If I have time, I'll try to profile this.

veikkoeeva commented 10 years ago

Some quick screen captures of slow cases. The fast case of CsvFile.Load was rather uneventful, as expected. profiling_result1 profiling_result2

ovatsus commented 10 years ago

It's expected that CsvProvider takes longer to compile, as it's reading the csv values and inferring the column types, while CsvFile is untyped. But it should be just a bit slower, not too much

ovatsus commented 10 years ago

Ok, this is a pathological case for CsvProvider. It only has 114 rows, but has 785 columns!!! This means we will have tuple with 785 elements. Most Csv's don't have this many columns, and honestly you won't get much value out using CsvProvider with this file, as this is basically a matrix serialized in csv format, all the types are the same. In any case we can probably optimize this, it's taking a lot of time generating the types, which is uncommon. Compare this csv: image WIth the measurements made in #514: image

ovatsus commented 10 years ago

Reduced from 17s to 12s

ovatsus commented 10 years ago

Down to 7s

ovatsus commented 10 years ago

I improved the time it takes to do the first line. As for the second one, I recommend you to use CsvFile for this case instead