Please describe the feature you'd like to see
Current ingest only includes stack overflow from a point-in-time (dated mid-Sep 2023) capture of archive https://archive.org/details/stackexchange. Initial discussion questioned terms of use stack overflow.
Further discussion indicates general argument seems to be that content is licensed under CC license which requires attribution. and if you fine tune a model on stack overflow data, attribution gets tricky. I guess for RAG it’s fine?
https://news.ycombinator.com/item?id=35647257
Describe the solution you'd like
Add extract function with stack api for extracting latest stack overflow messages.
Are there any alternatives to this feature?
Stay with archive only. But that means retrieval will degrade a bit over time.
Additional context
Acceptance Criteria
[ ] All checks and tests in the CI should pass
[ ] Unit tests
[ ] Integration tests (if the feature relates to a new database or external service)
[ ] Example DAG
[ ] Docstrings in reStructuredText for each of methods, classes, functions and module-level attributes (including Example DAG on how it should be used)
[ ] Exception handling in case of errors
[ ] Logging (are we exposing useful information to the user? e.g. source and destination)
[ ] Improve the documentation (README, Sphinx, and any other relevant)
132 adds logic for stack API extract but should be left open until we can figure out why it only grabs 200 posts. Appears to be a API rate limit of some sort but documentation is unclear.
Please describe the feature you'd like to see Current ingest only includes stack overflow from a point-in-time (dated mid-Sep 2023) capture of archive https://archive.org/details/stackexchange. Initial discussion questioned terms of use stack overflow.
Further discussion indicates general argument seems to be that content is licensed under CC license which requires attribution. and if you fine tune a model on stack overflow data, attribution gets tricky. I guess for RAG it’s fine? https://news.ycombinator.com/item?id=35647257
Describe the solution you'd like Add extract function with stack api for extracting latest stack overflow messages.
Are there any alternatives to this feature? Stay with archive only. But that means retrieval will degrade a bit over time.
Additional context
Acceptance Criteria
Note: