LineaLabs / lineapy

Move fast from data science prototype to pipeline. Capture, analyze, and transform messy notebooks into data pipelines with just two lines of code.
https://lineapy.org
Apache License 2.0
664 stars 58 forks source link

Add dask module annotation #844

Closed lazargugleta closed 1 year ago

lazargugleta commented 1 year ago

Description

Dask library is based on pandas classes and methods for data frames with the extension for large parallel computing and shuffling. Some module-specific methods required annotating, including file system methods from pandas such as read_csv.

Type of change

How Has This Been Tested?

Before adding annotation:

import dask.dataframe as dd
df = dd.read_csv("tests/simple_data.csv")

After:

import dask.dataframe as dd
df = dd.read_csv("tests/simple_data.csv")
df.pop("a")
lazargugleta commented 1 year ago

I see, you are correct @lionsardesai. I mistook those two for a method of DataFrame class. Would explode, melt, pop, drop, dropna, drop_duplicates be a part of it since they mutate the object? What side_effects should visualize and head have If those should even have annotation?

lazargugleta commented 1 year ago

Thanks to @lionsardesai for giving me better insight into what annotations are required for dask. After checking all methods in detail, the only method that could be used for annotations is pop because it changes the variable inplace. I updated the annotations and test files.

lazargugleta commented 1 year ago

Very nice! 🎉 Thanks to you too @lionsardesai