Evaluate some integration with Pandas

BelgianBiodiversityPlatform / python-dwca-reader

🐍 A Python package to read Darwin Core Archive (DwC-A) files.

BSD 3-Clause "New" or "Revised" License

45 stars 21 forks source link

Evaluate some integration with Pandas #58

Closed niconoe closed 6 years ago

tucotuco commented 7 years ago

Just a cautionary note. Pandas is dependent on NumPy, which is based on C libraries and will not function under Jython.

stijnvanhoey commented 7 years ago

I think it would we wortwhile to provide the integration, without having Pandas as a core-dependency of the package. As such, Pandas-users have a conventient entry towards the data, without having the core functionalities dependent on Pandas (i.e. numpy and C-libraries).

On the other hand, maybe just an example in the documetation could be sufficient. For example, as DwCAReader creates an unzipped temporary directory, this can directly be used to read the data into memory with Pandas:

import pandas as pd
from dwca.read import DwCAReader

dwca_name = './broedvogel_corrupted_subset.zip' #just a test-dataset
with DwCAReader(dwca_name) as dwca:

    #get the location of the core file, stored in temporary folder (could be another file as well)
    path = dwca.absolute_temporary_path('occurrence.txt')

    # read the core as dataframe
    core_df = pd.read_csv(path, delimiter="\t")

An alternative would be to create the DataFrame from the iterator (from row in dwca: ...)

niconoe commented 7 years ago

Hi @tucotuco and @stijnvanhoey, thanks a lot for your comments!

John's point is important: I don't want to add a heavy/possibly difficult to install/possibly incompatible dependency, at least not a core/required one.

But I also want to make it easier with Pandas, so I really like @stijnvanhoey's idea about doing this as an example in the documentation. However, I'm not familiar enough with Pandas to know if that use case is good enough or if we should provide more to our users.

Do you have any opinion on this? Would you like to contribute to the doc (writing code examples and/or reviewing mines)? That would be great!

stijnvanhoey commented 7 years ago

I agree and I'm certainly willing to contribute to the documentation.

In terms of functionality, getting the dwca data into a DataFrame (as shown above) is the most important. Once a DataFrame, the whole set of Pandas functionalities is available and users of Pandas will know what to do. So, I would not make a whole tutorial about Pandas, but maybe showcase some introductionary slicing examples, boolean filtering applications and a plot example for users who might not know why Pandas could be useful. I'll provide some examples.

Furthermore, we can guide people to other resources. As an example, we prepared some material for a doctoral schools course on data manipulation, with an application on biodiversity data cleaning and visualisations.

peterdesmet commented 7 years ago

Good idea! 👍

niconoe commented 6 years ago

FINALLY merged @stijnvanhoey contribution, very sorry about the delay.

Next steps for easier/cleaner pandas integration and documentation

[x] Improve the tutorial in terms of using the descriptor to provide arguments to read_csv. Done for the first example, need to be done for the rest.
[ ] pd.merge(): shouldn't we use the descriptors for left_on and right_on parameters?
[x] Make sure those examples work fine in different cases (header lines or no, ...)
[x] Optional: add method to make all this this automatic (Example: pd.read_csv(dwca.pd_read_arguments('occurrence.txt))
[x] Move Pandas tutorial to its own page in doc?

stijnvanhoey commented 6 years ago

Thanks for the follow up of the work and the improvements you already did on using the descirptors.

Considering the next-steps, I like the idea of a dwca.pd_read_arguments method, as it enables futher integration without having the explicit need of Pandas.

For the pd.merge example, I would use the example column names explicitly and not descriptors. Once read into a df, it is sort of decoupled from the dwca-reader and it is all about Pandas handling. People will typically do a taxon_df.head(), check the column names and work with these names to do selections, slicing,.. and also merge operations. I would keep the division as such, you wouldn't use the descriptors in the slicing examples either.

Having the Pandas tutorial in a separate page is fine for me.