AI-Northstar-Tech / vector-io

Comprehensive Vector Data Tooling. The universal interface for all vector database, datasets and RAG platforms. Easily export, import, backup, re-embed (using any model) or access your vector data from any vector databases or repository.
https://vector-io.com
Apache License 2.0
216 stars 27 forks source link

Create LanceDB index after table is created in import #80

Open dhruv-anand-aintech opened 6 months ago

dhruv-anand-aintech commented 6 months ago
Checklist - [X] Modify `src/vdf_io/import_vdf/lancedb_import.py` ✓ https://github.com/AI-Northstar-Tech/vector-io/commit/f168003cd3994a1082afd1126b665682b0d852f8 [Edit](https://github.com/AI-Northstar-Tech/vector-io/edit/sweep/create_lancedb_index_after_table_is_crea/src/vdf_io/import_vdf/lancedb_import.py) - [X] Modify `src/vdf_io/import_vdf/lancedb_import.py` ✓ https://github.com/AI-Northstar-Tech/vector-io/commit/f168003cd3994a1082afd1126b665682b0d852f8 [Edit](https://github.com/AI-Northstar-Tech/vector-io/edit/sweep/create_lancedb_index_after_table_is_crea/src/vdf_io/import_vdf/lancedb_import.py)
sweep-ai[bot] commented 6 months ago

🚀 Here's the PR! #87

See Sweep's progress at the progress dashboard!
💎 Sweep Pro: I'm using GPT-4. You have unlimited GPT-4 tickets. (tracking ID: a4abad1443)

[!TIP] I can email you next time I complete a pull request if you set up your email here!


Actions (click)


Step 1: 🔎 Searching

I found the following snippets in your repository. I will now analyze these snippets and come up with a plan.

Some code snippets I think are relevant in decreasing order of relevance (click to expand). If some file is missing from here, you can mention the path in the ticket description. https://github.com/AI-Northstar-Tech/vector-io/blob/9cec7fece241357cabdb153511b13c9c9236fb0a/src/vdf_io/import_vdf/lancedb_import.py#L1-L163 https://github.com/AI-Northstar-Tech/vector-io/blob/9cec7fece241357cabdb153511b13c9c9236fb0a/src/vdf_io/util.py#L1-L503

Step 2: ⌨️ Coding

from lancedb import create_index

# Get the ID column from the parquet file schema
parquet_schema = pq.read_schema(parquet_files[0])
id_column = "id" # Default 
for field in parquet_schema:
    if field.name == ID_COLUMN:
        id_column = field.name
        break

# Create index on the table  
create_index(table, id_column)
tqdm.write(f"Created index on {id_column} for table {new_index_name}")

This code reads the schema of the first parquet file to determine the name of the ID column (defaulting to "id" if not found). It then calls create_index passing the table object and ID column name to create an index on that column.


Step 3: 🔁 Code Review

I have finished reviewing the code for completeness. I did not find errors for sweep/create_lancedb_index_after_table_is_crea.


🎉 Latest improvements to Sweep:
  • New dashboard launched for real-time tracking of Sweep issues, covering all stages from search to coding.
  • Integration of OpenAI's latest Assistant API for more efficient and reliable code planning and editing, improving speed by 3x.
  • Use the GitHub issues extension for creating Sweep issues directly from your editor.

💡 To recreate the pull request edit the issue title or description. Something wrong? Let us know.

This is an automated message generated by Sweep AI.