Open SolbiatiAlessandro opened 1 year ago
note: ingest_docs.ingest_docs
function and main.py
are untouched so this doesn't change anything in core worklfow from the readme
added commits for onboarding data_gen.py
to luigi, now you can run also
(sheep-tutor) root@C.6209776:~/llamaacademy$ python data_gen.py
[...]
===== Luigi Execution Summary =====
Scheduled 3 tasks of which:
* 2 complete ones were encountered:
- 1 IngestDocs(url_docs=https://developers.notion.com/reference, recursive_depth=0)
- 1 LaunchInstructionGeneration(...)
* 1 ran successfully:
- 1 LaunchMachineOutputData(...)
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
And the files are correctly generated
(sheep-tutor) root@C.6209776:~/llamaacademy$ ls assets/*.json*
assets/data_LaunchMachineOutputData_500_gpt_3_5_turbo_1_1ab2082155.json assets/seed_instructions.jsonl
assets/generated_instructions_LaunchInstructionGeneration_3_140_assets__f14595ed93.jsonl
Great improvement to the pipeline, thanks for your suggestion of using luigi. Have you run main.py successfully with following changes in the pipeline? We might just need one example generated is enough for unit testing.
Thanks @huyphan168 ! Will post an example of successful main.py run soon and we can merge! In the meanwhile folks can already use this branch to run some tasks in the pipeline
Currently the scraping task fails on half of the links (e.g. 26 docs over 56 succesfully scraped for Notion APIs). This means
main.py
will often fail on the scraping step without being able to continue the workflow.This PR improves reliability by
ingest_docs.py
anddata_gen.py
to https://github.com/spotify/luigi workflow management library, so tasks can be run independently and at different times/on different machinesTest Plan