DeputyApp / singer-tap-deputy

Deputy Stitch Stream
MIT License
3 stars 7 forks source link

Submission Review Notes #7

Closed dmosorast closed 5 years ago

dmosorast commented 5 years ago

Adding notes to this issue as I review this tap for acceptance. These should be rather small but important changes to make before we can move this tap forward.

Metadata It looks like the custom metadata for resource is written has the wrong namespace. It should be tap-deputy.resource.

Del on kwargs Should this line be calling delete on a parameter passed in? I'm not quite sure how Python behaves in all kwargs situations (e.g., splat, named keywords, etc.), but this seems like a side-effect that would be better avoided.

all_selected This should be removed from here. I see the usefulness of this feature to enable testing without having to select all streams, however it should be an optional configuration argument, rather than implied by the absence of a catalog. The absence of a catalog should mean no streams are selected, since that's what it is.

current_stream This isn't being handled correctly (here), it should attempt to run through every stream when it finds a current_stream state entry, rather than skipping earlier streams and only attempting them on the next tap run. We just created a helper function to manage this called get_selected_streams on the Catalog class in singer-python 5.7 (implementation)

awm33 commented 5 years ago

@dmosorast

Metadata It looks like the custom metadata for resource is written has the wrong namespace. It should be tap-deputy.resource.

Done.

Del on kwargs Should this line be calling delete on a parameter passed in? I'm not quite sure how Python behaves in all kwargs situations (e.g., splat, named keywords, etc.), but this seems like a side-effect that would be better avoided.

kwargs are a dictionary and can be manipulated like one. Here, the scope would be isolated to a function.

all_selected This should be removed from here. I see the usefulness of this feature to enable testing without having to select all streams, however it should be an optional configuration argument, rather than implied by the absence of a catalog. The absence of a catalog should mean no streams are selected, since that's what it is.

The intention of that is to handle the case when the tap is run without a catalog. I actually don't usually handle this case, but saw it on another new tap I didn't author and thought it was a good idea since the --catalog is an option. I can remove the default if desired and the tap would hard fail without a catalog.

current_stream This isn't being handled correctly (here), it should attempt to run through every stream when it finds a current_stream state entry, rather than skipping earlier streams and only attempting them on the next tap run. We just created a helper function to manage this called get_selected_streams on the Catalog class in singer-python 5.7 (implementation)

Taps are supposed to pick up where they left off using state to store the currently syncing stream. Are you proposing to remove this functionality?

dmosorast commented 5 years ago

Re: del on kwargs

I get that kwargs is a dict, but my concern was that it would modify the caller's reference to the args object in the event it was splatted. Just tested locally to confirm it's a copy, and it looks good (example below). I'm not a fan of that magic being implied, but it's just a style concern in that case.

>>> def side_effect(**kwargs):
...     del kwargs["foo"]
...
>>> my_thing = {"foo": 1}
>>> my_thing
{'foo': 1}
>>> side_effect(**my_thing)
>>> my_thing
{'foo': 1}

Re: Lack of a --catalog parameter.

I'm not sure where my head was at when I wrote that, but looking through the Spec and how we handle the case where a catalog isn't provided, it seems like this should be okay. :+1:

Re: current_stream handling

You're correct that a tap should resume from the stream in the state, but it should run through all the streams when it's invoked. The way I read this is that if there are streams ["A","B","C","D"] and a state with {"current_stream": "C"}, the tap will only run ["C","D"], whereas it should run ["C","D","A","B"] in that order.

awm33 commented 5 years ago

@dmosorast That's not how other taps behave and I don't believe that to be desired behavior. Eg, a tap that runs once a day is running, it syncs "A", "B", and fails or is terminated on "C". It retries 10 minutes later, picking up at "C", then finishing with "D". "A" and "B" should be the next run.

dmosorast commented 5 years ago

That approach was an oversight, and we're updating taps to treat the streams as a kind of circular queue, where if a tap is run daily, fails day 1 and succeeds day 2, the user doesn't have to wait until day 3 or intervene manually to get the other streams' data.

This is the official implementation, on the Catalog class, but there are a few examples in the wild in tap-hubspot and tap-shopify, for example.

jacobrobertbaca commented 5 years ago

Hey @awm33 I'm taking over for @dmosorast, any questions/updates here?

awm33 commented 5 years ago

Hi @jacobrobertbaca , I updated the tap to use the recently released get_selected_streams and set_currently_syncing functions, as Dan instructed above.

dmosorast commented 5 years ago

These fixes are complete! Closing.