dlang / project-ideas

Collection of impactful projects in the D ecosystem
36 stars 12 forks source link

Tabular data container (data frames) #15

Open burner opened 5 years ago

burner commented 5 years ago

Pandas, R and Julia have made data frames very popular. As D is getting more interest from data scientist (e.g. eBay or AdRoll) it would be very beneficial to use one language for the entire data analysis pipeline - especially considering that D (in contrast to popular languages like Python, R or Julia) - is compiled to native machine code and gets optimized by the sophisticated LLVM backend.

Minimum requirements:

burner commented 5 years ago

is being worked on by Prateek Nayak during gsoc 2019

wilzbach commented 5 years ago

CC @Kriyszig

Kriyszig commented 5 years ago

Yes, I will be working on this project.

So far I have contacted the mentors and am exploring ndslice in mir-algorithms, while also looking into displaying the dataframe on the terminal with properly aligned columns. I'm a bit tight on time till this weekend because of final examination but after that I'll be working at my maximum capacity to realize the project. We still need to discussing the structure of index to represent multi indexed dataframes after which I'll jump onto parsing of CSV files to dataframes. At this point the dataframes will support adding multi-indexed data to the dataframe, parsing from files and writing to CSV. Next will deal with access of elements, column binary ops.

I'm mostly looking into Pandas and it's implementation of dataframes mostly because I have worked quite extensively with Python in the past. I'll update the issue with any and all progress made regarding the dataframe project

Laeeth commented 5 years ago

Interop with pandas via JSON and msgpack might be quite helpful. I have written a streaming msgpack decoder (using msgpack-d) to work with our own simple data frame implementation, and there is some old code for reading and writing to hdf5 too.

9il commented 5 years ago

Initial support for dataframe has been added to mir-algorithm. Only allocation and labels access for now.

@safe pure unittest
{
    import mir.ndslice.slice;
    import mir.ndslice.allocation: slice;

    import std.datetime.date;

    auto dataframe = slice!(double, Date, string)(4, 3);
    assert(dataframe.length == 4);
    assert(dataframe.length!1 == 3);
    assert(dataframe.elementCount == 4 * 3);

    static assert(is(typeof(dataframe) ==
        Slice!(double*, 2, Contiguous, Date*, string*)));

    // Dataframe labels are contiguous 1-dimensional slices.

    // Fill row labels
    dataframe.label[] = [
        Date(2019, 1, 24),
        Date(2019, 2, 2),
        Date(2019, 2, 4),
        Date(2019, 2, 5),
    ];

    assert(dataframe.label!0[2] == Date(2019, 2, 4));

    // Fill column labels
    dataframe.label!1[] = ["income", "outcome", "balance"];

    assert(dataframe.label!1[2] == "balance");

    // Change label element
    dataframe.label!1[2] = "total";
    assert(dataframe.label!1[2] == "total");

    // Attach a newly allocated label
    dataframe.label!1 = ["Income", "Outcome", "Balance"].sliced;

    assert(dataframe.label!1[2] == "Balance");
}