TogetherCrew / airflow-dags

1 stars 1 forks source link

[Hivemind] GitHub duplicate documents creation #190

Closed amindadgar closed 3 months ago

amindadgar commented 3 months ago

As we've moved to a llama-index pipeline and we're having a docstore database for checking duplicate documents, we now have to assign a specific id to each llama-index document so no duplicate data would be created.

The way it is now is it would assign a random id to each document while creating documents. We have to assign a pr/issue/comment id to each document based on its content.