improve pipeline(ingest_docs and data_gen) reliability

SolbiatiAlessandro commented 1 year ago

Currently the scraping task fails on half of the links (e.g. 26 docs over 56 succesfully scraped for Notion APIs). This means main.py will often fail on the scraping step without being able to continue the workflow.

This PR improves reliability by

onboarding ingest_docs.py and data_gen.py to https://github.com/spotify/luigi workflow management library, so tasks can be run independently and at different times/on different machines
adds try catch on Selenium exceptions so scraping can succeed even if partial

Test Plan

(sheep-tutor) root@C.6209776:~/llamaacademy/assets$ python ingest_docs.py
...
===== Luigi Execution Summary =====

Scheduled 3 tasks of which:
* 3 ran successfully:
    - 1 IngestDocs(url_docs=https://developers.notion.com/reference, recursive_depth=1)
    - 1 IngestDocumentsTask(url_docs=https://developers.notion.com/reference, recursive_depth=1)
    - 1 SaveVectorStoreTask(url_docs=https://developers.notion.com/reference, recursive_depth=1)

(sheep-tutor) root@C.6209776:~/llamaacademy/assets$ find . -type f -name "*_0a560f2ee1*"
./documents_IngestDocumentsTask_1_https___develope_0a560f2ee1.pkl
./docs_for_summary_IngestDocumentsTask_1_https___develope_0a560f2ee1.pkl
./vectorstore_SaveVectorStoreTask_1_https___develope_0a560f2ee1.pkl

SolbiatiAlessandro commented 1 year ago

note: ingest_docs.ingest_docs function and main.py are untouched so this doesn't change anything in core worklfow from the readme

SolbiatiAlessandro commented 1 year ago

added commits for onboarding data_gen.py to luigi, now you can run also

(sheep-tutor) root@C.6209776:~/llamaacademy$ python data_gen.py

[...]
===== Luigi Execution Summary =====

Scheduled 3 tasks of which:
* 2 complete ones were encountered:
    - 1 IngestDocs(url_docs=https://developers.notion.com/reference, recursive_depth=0)
    - 1 LaunchInstructionGeneration(...)
* 1 ran successfully:
    - 1 LaunchMachineOutputData(...)

This progress looks :) because there were no failed tasks or missing dependencies

===== Luigi Execution Summary =====

And the files are correctly generated

(sheep-tutor) root@C.6209776:~/llamaacademy$ ls assets/*.json*
assets/data_LaunchMachineOutputData_500_gpt_3_5_turbo_1_1ab2082155.json                   assets/seed_instructions.jsonl
assets/generated_instructions_LaunchInstructionGeneration_3_140_assets__f14595ed93.jsonl

huyphan168 commented 1 year ago

Great improvement to the pipeline, thanks for your suggestion of using luigi. Have you run main.py successfully with following changes in the pipeline? We might just need one example generated is enough for unit testing.

SolbiatiAlessandro commented 1 year ago

Thanks @huyphan168 ! Will post an example of successful main.py run soon and we can merge! In the meanwhile folks can already use this branch to run some tasks in the pipeline

danielgross / LlamaAcademy

improve pipeline(ingest_docs and data_gen) reliability #10

Test Plan