Wikia / sroka

Python library for API access and data analysis in Product, BI, Revenue Operations (GAM, GA, Athena etc.)
MIT License
71 stars 11 forks source link

developing intake-sroka #18

Open martindurant opened 5 years ago

martindurant commented 5 years ago

This is a very convenient interface to several data provider APIs that have been requested in terms of the Intake and Dask projects.

I have just created intake-sroka, so that specific queries to the APIs can be saved as data sources, and stored in Intake's cataloging system. You are very welcome to comment and participate, to bring such data to wider attention!

I wonder, have you thought about how to access data in a parallel or distributed way? Many query outputs might be partitionable, and Dask dataframe makes turning a set of data-frame partitions into a logical dataframe for parallel, out-of-core and/or distributed processing easy. We already do this, for example when reading from parquet or SQL servers.

martynaut commented 5 years ago

Thank you! We're really happy that you were able to include our library into intake project. We will take a look at intake-sroka.

As for parallel/distributed way of data access, this is very interesting idea. For now it is not on our roadmap to include it, but I will label this issue as enhancement. And we can return to it later (or maybe there will be other contributors - or you - that would like to add it to sroka?). For some of the API I think it may be problematic (due to restrictions on number of queries and time required between queries). Also for some data sources (like MOAT, Rubicon) data is not so complex that it would really make sense, for other it definitely would.

As for Dask dataframe itself, you mean that it would be helpful to have output available also as Dask dataframes?

martindurant commented 5 years ago

As for Dask dataframe itself, you mean that it would be helpful to have output available also as Dask dataframes?

Not, I don't think it would be necessary to do this on your end, especially if intake-sroka is to become fully functional. What Dask normally needs is:

As you say, it may well not make sense to bother for some of the APIs - you know what kind of data to expect in each much better than me - but for the cases where there is big data and parallel access might be beneficial, it would certainly be nice to have.

How do you go about testing your calls? Intake-sroka would likely want to copy any methods you have (since I don't actually have easy access to the real APIs)

martindurant commented 5 years ago

I addition, I guess, your APIs also bring the question of auth: it may be good to do this once and pass around the auth objects between processes, rather than having to re-authenticate in every task. Or maybe it's fast, because only the reading of some json file on the disc is needed. I simply don't know, so it's worth discussing.

martindurant commented 5 years ago

I'll be circling back to this shortly. The very best help that I could use, is a way to test data read functions without actually connecting to the cloud or having valid credentials. Do you have some mocking solution or other testing infrastructure I can use?

martynaut commented 5 years ago

For Dask: thank you for clarification. I think that would definitely be helpful with s3 data and maybe GAM data too as this can be pretty complex. For auth: in most cases it is fast. One different case is for first auth for google products as those require to authenticate through a link. For mocking solution/testing infrastructure: We don't have ready solution for you yet, but this is something important for us too for testing purposes and we work on it. As for connections we have notebook scenarios that we always use but as you said, some tests should be available without credentials. As I mentioned we are working on tests, do you have any specific use cases that you'd like to be included? We will include those in this repository when ready.

martindurant commented 5 years ago

When I have rounded out intake-sroka more, I will be sure to be in touch and get you to test against your data and credentials, until there are concrete ways to tests at least some of the services.