Crunch-io / scrunch

Pythonic scripting library for cleaning data in Crunch
GNU Lesser General Public License v3.0
5 stars 7 forks source link

Allowing Project Paths to be passed to scrunch.dataset.fork() method #463

Closed shaikh-ma closed 15 hours ago

shaikh-ma commented 1 month ago

The owner parameter of the fork() method allows creating a dataset fork in a defined project location. But to make it work the owner parameter always needs to be the Project URL e.g., https://<company>.crunch.io/api/projects/<project_id>/.

This often requires the user, an additional step, to manually find out the project URL to pass as the owner parameter.

It is comparatively easier to work with the project folder paths e.g., "Client datasets | Projects A | Project A - 1 " than the project URL, which also is more human friendly. Hence, this Pull Request adds a new feature to fork() method to allow the user to pass the project path to the owner parameter as well, as requested in the issue.

The previous functionality of the fork() method is still the same, only the behavior of the owner parameter has been updated.

On passing project path (instead of a project URL), the fork() method now automatically maps the folder path to its respective Project URL. Thus, preventing the user an additional step for finding out the Project URL manually.

Examples:

import scrunch

ds = scrunch.get_dataset("Dataset X", project="Project A")

# Passing project path to `owner`
ds_fork = ds.fork(preserve_owner=False, owner="Project B | Project 1")

# Passing project URL to `owner`
ds_fork = ds.fork(preserve_owner=False, owner="https://<company>.crunch.io/api/projects/<project_id>/")

Additionally, this PR handles the timeout error (TaskProgressTimeoutError) which often occurs when forking large datasets.

shaikh-ma commented 1 month ago

Hello @jjdelc, @aless10, 👋

Please could you take a look at this PR, when you get a chance.

Thanks

shaikh-ma commented 1 month ago

This sounds a great idea @jjdelc! 🙂

Initially @alextanski & I, had a similar thought to add a new parameter "project".

But we had few concerns about it, then:

  1. It might become confusing both the developer & the users when/how to use the owner & project parameter. And their behaviour with preserve_owner.
  2. Handling of undesired scenarios e.g., user might be passing both these parameters or none. Et cetra.

Hence, the current changes were aimed to not change the current behaviour of the fork() method while adding the needed features additionally. So as to maintain simplicity & backward compatibility.

As you've now suggested to remove the "owner" completely, just a few questions:

  1. How should we handle the scenario when the user needs the fork dataset to be created in their personal project?
  2. Should we rename or remove the preserve_owner parameter?

Thanks

jjdelc commented 1 month ago

You make a good point about backwards compatibility, and I think it should be easy to handle in inside the method.

For now, you may need to keep the owner behavior as it is. And add the new project field. Handle inside the call if both are sent raise an error and suggest to only send project, if none are sent, follow the current behavior.

My other point is not to require the user to send URLs, but allow for a Project() instance to be received, I would like to support arbitrary paths too, but you'll risk it that the path may not exist and have to then take care of creating and such, best to have that responsibility outside of this .fork() call and once they solved that they would have a Project() instance.

Also, add a test checking your payload.

shaikh-ma commented 2 days ago

Hi @jjdelc, @aless10 👋 ,

Please let me know if anything else is needed from our end or it's good to be merged. 😄

jjdelc commented 15 hours ago

Merged on master, I'll make another minor release to include this