Closed mpgreg closed 10 months ago
Note:
Discussed with @pankajastro - a single HTML extractor makes sense.
just adding more here a single extractor makes sense but still we require a thin layer over it for the different sources because we need some different cleanup approaches for different sources for example in Astro SDK I'm excluding if the docs URL has "autoapi", "genindex.html", "py-modindex.html", ".md", ".py"
but for provider excluding "_api", "_modules", "_sources", "changelog.html", "genindex.html", "py-modindex.html", "#"
. So at least a different task per source makes sense to me and if we keep a different dag per source then maybe it will be easy to run i.e we can upsert only the source we want
Created draft PR on this one
just marked PR as ready for review, would appreciate a review
Please describe the feature you'd like to see Multiple extract functions use almost identical HTML extract logic.
Describe the solution you'd like Should consolidate to a single function if possible and use dynamic task mapping like github extract.
Are there any alternatives to this feature?
Additional context
Acceptance Criteria
Note: