Closed burnash closed 6 months ago
@burnash I have created a ticket to fix the paginator and jsonpath detection for vimeo: https://github.com/orgs/dlt-hub/projects/12/views/1?pane=issue&itemId=63884181. I did not see any other actionables from your notes, let me know if I missed something that is part of the generator.
Hackathon Feedback
Questions & Answers
Is it clear why we have created this, why it is useful, and what it is about?
Yes
Is it clear how the generator works? Did you manage to generate anything in the first 10 minutes after selecting a spec? What is missing from the setup instructions or the output of the generator?
Yes, I was able to generate a pipeline, the setup instructions were very clear. What’s missing IMO is a way to “regenerate” an existing pipeline: at one point while experimenting I decided to add more endpoints to the pipeline afaik the way to do that was to delete the pipeline folder and regenerate the pipeline from scratch.
Is the resulting dlt rest_api source legible? Should it be structured differently or annotated with comments better?
To me it’s legible because I know the rest_api pretty good. I like how the folder is structure. I also liked the placeholder params. One thing that I miss is the example values for those params (see my raw feedback for details).
Could you run the pipeline after generation? Did it produce some data?
I tried pokemon and vimeo pipelines. Both fail on the first run. Looks like pokemon was out of sync with the actual API and Vimeo had some issues in generated rest_api dict. (see full description in the raw notes)
If something failed, was the reason for the failure clear? What error message would have been better?
Generator didn’t fail, resulting pipelines did, so it’s only relevant to the rest_api:
I understood the errors because I’m familiar with rest_api. However some messages potentially not very clear
Was anything incorrectly converted from the spec to the rest_api definition although it is clear how it should have been generated? If so, which section and what should have been produced?
For Vimeo pipeline:
Paginator settings were generated incorrectly (see the raw notes for the details):
While the response was:
Total path is there but was generated explicitly as an empty string.
The “child” resource had an incorrect path
Are there any settings, options, or commands you are missing from the tool?
No
Raw Notes
Stacktrace:
I’ve change manually resolve param from “id” to “name” and extraction worked.
I think what lacks is some info about the amount of data to be extracted: e.g. in case of pokemons - there were 2k+ objects so a lot of requests.
Having enlighten ad progress is nice, but it always shows 0%.
After loading, pipeline crashed with
The terminal hanged in enlighten. I think I forgot to change primary key to name as well.
I decided to switch to Vimeo API. It was easy to get API key for testing.
I selected two endpoints: /categories and /categories/{category}
The pipeline was generated without any problems.
I decided to first test the auth with curl and asked ChatGPT to generate me a curl a root endpoint. This was successful.
I opened a pipeline file and was pleased to find
I don’t know what values for each parameters are available, but found the description int the OpenAPI file.
There was no information about authentication, so when I run the pipeline it crashed with 401 (Unauthorized):
Since I know how rest_api works I edited pipeline files to enable bearer auth for the token I get.
The detector from generator incorrectly detected paginator:
Note
total_path
is an empty string.While response is has links to next and previous pages + total key:
After I inserted the authentication pipeline failed with:
It does not reveal the body of the error, but with curl I was able to see it:
Sidenote: paginator started from 0 (default): I think it’s an extremely rare case so we’d need to change the default here https://github.com/dlt-hub/dlt/blob/devel/dlt/sources/helpers/rest_client/paginators.py#L222 to 1.
I changed paginator to “json_response” the 401 was gone but a new error appeared:
Since I know rest_api I figured that the problem is in explicit data selector on categories which was wrong
I commented it out to allow for autodetection and did another run. It looked like the list endpoint data was now fetched but the problem now was with child endpoint:
The url looks weird:
categories
is duplicated inhttps://api.vimeo.com/categories//categories/adsandcommercials
Note: it’d be great to be able to see an example item data when debugging. I did just that and inserted a pdb into a transformer code at rest_api. When I printed a sample item i found that resolved param is wrong – In the generated config:
It is referencing uri field. And this field already has
/categories/
prefix:{'uri': '/categories/adsandcommercials', 'name': 'Ads and Commercials', 'link': ..}
After I changed it to
Pipeline run successfully. 🚀