Migrate Analysis Pipeline to A New Repository

snakedye commented 1 month ago

Description: The entire analysis pipeline from the fertiscan-backend repository will be migrated into a new repository, fertiscan-pipeline. This migration aims to modularize the project, making the pipeline code reusable and maintainable as a standalone package.

This will reduce costs as the GPT related code will not be executed in the workflow if it is not modified.

Tasks:

[x] Migrate the entire analysis pipeline code from fertiscan-backend to the fertiscan-pipeline repository.
[x] Create a pyproject.toml in fertiscan-pipeline to ensure proper packaging and dependencies.
[x] Add fertiscan-pipeline as a dependency in the fertiscan-backend project by updating requirements.txt:
[x] Update import statements in fertiscan-backend to use the new fertiscan_pipeline package.
[x] https://github.com/ai-cfia/fertiscan-pipeline/issues/3
[ ] Update documentation to reflect the changes in the code structure.

Acceptance Criteria:

The analysis pipeline code is successfully migrated to the fertiscan-pipeline repository.
The fertiscan-backend project uses the fertiscan-pipeline package without any issues.
The workflow on fertiscan-pipeline should be working.

Endlessflow commented 1 month ago

Correct me if I am wrong, the reason we want to create a new repo is to ensure that when we make changes to our image processing pipeline, we can run adequate performance testing to determine if the modifications are better or worse than the current pipeline. Since these tests are resource-intensive and we want to make it clear when a modification is made to the processing pipeline, we decided it would be beneficial to turn that part of the code into its own independent module.

With this said, I wonder if the scope of the new repo is just GPT or the whole processing pipeline. Especially as we flesh it out in the futur with more components to get better results.

k-allagbe commented 1 month ago

Correct me if I am wrong, the reason we want to create a new repo is to ensure that when we make changes to our image processing pipeline, we can run adequate performance testing to determine if the modifications are better or worse than the current pipeline. Since these tests are resource-intensive and we want to make it clear when a modification is made to the processing pipeline, we decided it would be beneficial to turn that part of the code into its own independent module.

With this said, I wonder if the scope of the new repo is just GPT or the whole processing pipeline. Especially as we flesh it out in the futur with more components to get better results.

That's correct.

Although, it is common practice to separate services in their own package for re-usability. The end goal, which is to avoid testing the heavy processes at every small change in the backend, will still be achieved: the OCR and the GPT packages will be tested in their own repository, separately, with input and output data carefully selected to be representative of the checkpoints at which each is invoked in the backend. Basically the same thing, but split in two for the benefit of modularity.

Edit: It is still possible to have both in the same repository and implement a single pipeline like test on both.

k-allagbe commented 1 month ago

After discussion with @Endlessflow. Testing the whole processing pipeline makes more sense than each component individually. Please, @snakedye consider exporting both OCR and GPT in the same repository.

snakedye commented 1 month ago

@Endlessflow @k-allagbe Might as well move everything in ./backend in the new repo because how you build the document from the raw images is also part of the pipeline. It will have an effect on the end result. All that will be left is the Flask router.

If we go with that approach is there a substantial gain from what we have now?

Now most of the code is related to the pipeline and the flask router is only moving because we are still working on the API. Once that's set it won't change much and most of the changes will affect the pipeline anyway.

k-allagbe commented 1 month ago

@Endlessflow @k-allagbe Might as well move everything in ./backend in the new repo because how you build the document from the raw images is also part of the pipeline. It will have an effect on the end result. All that will be left is the Flask router.

If we go with that approach is there a substantial gain from what we have now?

Now most of the code is related to the pipeline and the flask router is only moving because we are still working on the API. Once that's set it won't change much and most of the changes will affect the pipeline anyway.

I'm comfortable with having the whole analysis pipeline separated into the new repository (which I suggest naming fertiscan-ai-pipeline or fertiscan-analysis-pipeline). The backend also has the responsibility of communicating with the db and any frontend client, which may or may not be subject to rapid changes. In any case, this is a good practice.

Endlessflow commented 1 month ago

For transparency, we reached a consensus on the idea of separating the entire analysis pipeline into a new repository.

ai-cfia / fertiscan-backend

Migrate Analysis Pipeline to A New Repository #94