Collaboration possibility?

ba66e77 commented 1 year ago

https://gitlab.com/ba66e77/data-file-summarizer

I've been working on a summarizer project, linked above, the goal of which is to provide a simple summary of the content of a datafile as a way of orienting a person quickly to the file. There's some overlap between the feature set of what I'm working on and what you have here, but there is a good bit of unique features as well. Would you be interested in collaborating? I can see, potentially, my summarizer using overviewpy as a dependency or overviewpy incorporating the features of what I've been working on.

cosimameyer commented 1 year ago

Hi Barrett,

Thanks for reaching out! This sounds great - happy to find someone who has a similar vision in mind ☺️ I can't access the link you shared (I get a 404, probably because it's a private repository?) but I think both approaches that you suggested make sense. Depending on the features you have in mind/are working on, it might make sense to move your summarizer project to overviewpy ☺️

ba66e77 commented 1 year ago

🤦 Well, that's embarrassing. Yeah, I had forgotten it was still a private repo. Should be public now.

There's a validation feature in the summarizer which is not likely to be of broad interest. I have an Issue filed to pull that out.

What could be of interest to overviewpy is the command line interface and generation of an HTML file that presents, for each column in the file,

The column name
The number of non-null/empty values for that column
The number of unique non-null values for that column
A set of 5 example values found in that column.

I've found it really helpful for those "what's in this file the client sent me" needs and to see what might need deeper investigation, e.g., "Hey, y'all said this XID column should be a unique identifier but in this file you have 5,000 records and only 4,500 unique values of XID"

ba66e77 commented 1 year ago

We may have a use case mismatch between the projects.

I intended my summarizer project to be pretty broadly applicable. As a user story, the purpose is something like...

As a member of a data focused project, I need to quickly evaluate if a provided file has the columns and values I expect to see, so that I can tell if it at least appears to be fit for some purpose (e.g., ingestion to a data lake) before I invest time in deeply examining it.

As I look more at overviewpy (and the related R project), it looks like you're targeting a more specific use case where you want to examine timeseries data and already have knowledge of the data key and column names the file contains. (I don't have enough understanding of overviewpy to write a summarizing user story for it)

Is expanding the use case of overviewpy something you're interested in or do you want to keep it focused on time series data?

cosimameyer commented 1 year ago

Thanks so much for sharing your thoughts! When starting with overviewR, our main use case was in fact what overview_tab (and overview_latex) do - prepare some time-cross-sectional data as an overview (for publication). But we always had a much broader vision in mind and added overview_na with exactly the user story in mind that you described above: someone gives you a data set and you want to quickly get a glance at it and see how much missing data you have, etc.

For now, we stayed mainly with the specific use cases of data where you know the time and id features — but that was mainly because we were seeing the need for this and that no other package covered these functionalities by now. There are alternatives in R (https://github.com/cosimameyer/overviewR#whats-unique-about-overviewr) and in Python (sweetviz, autoviz, pandas-profiling/ydata-profiling or dtale come to my mind) that give you a general overview of the data but they are often not the perfect tool. May it be the dependency-heaviness or the missing functionalities. I was ideally thinking of something light that you could also easily (let) add to an air-gapped environment (if the workflow permits) and that’s not super dependency heavy. But that’s not a must 😊

I would love to see overviewpy become something like the Swiss army knife/an allrounder tool for getting an overview of any/different kind of data sets without necessarily knowing beforehand what’s in the data 😊

Let me know what you think! 😊

ba66e77 commented 1 year ago

Ok, awesome. Then I think that obviates my use case concern.

If you've had a chance to look at the summary functionality in the library I've got, and feel comfortable with that, I'll start working on integrating that into overviewpy and open up some PRs.

cosimameyer commented 1 year ago

That sounds fantastic! My understanding is to integrate the functionality of summarizer within a new method that the user can call and that will then generate the summarizer output. Should we give the new main method a name that goes with the overview_ logic? I thought for instance of overview_summary (or _summarizer, if you prefer 😊) but am open to other ideas 😊

Let me know what you think 😊

ba66e77 commented 1 year ago

Yeah, that's kind of what I'm thinking as well. What I'm pondering on at the moment is how to best align the object oriented approach the summarizer takes with the functional approach overviewpy takes. How would you feel about adding an object for the work overviewpy currently does and have the existing functions delegate to that object?

cosimameyer commented 1 year ago

Absolutely works for me ☺️ (and thanks for the PR - I’m excited to join forces here!)

cosimameyer / overviewpy

Collaboration possibility? #14