Closed dhruv-anand-aintech closed 7 months ago
Oh no, I think I know what the issue is here. Could you go to the Job History
page on the setup for your medium_articls_2 dataset? You should see the umap command that errored. it would be helpful to know the full command it tried to execute.
info_map job 577ae365-c579-4fe7-bacf-11f0db68fa4c
error
Running umap
ls-umap info_map undefined 25 0.1 --init=None
Loading environment variables from: /Users/dhruvanand/Code/vector-io/.env
loading embeddings
Traceback (most recent call last):
File "/Users/dhruvanand/miniforge3/bin/ls-umap", line 8, in <module>
sys.exit(main())
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/latentscope/scripts/umapper.py", line 29, in main
umapper(args.dataset_id, args.embedding_id, args.neighbors, args.min_dist, save=args.save, init=args.init, align=args.align)
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/latentscope/scripts/umapper.py", line 49, in umapper
with h5py.File(embedding_path, 'r') as f:
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/h5py/_hl/files.py", line 562, in __init__
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/h5py/_hl/files.py", line 235, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 102, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = './info_map/embeddings/undefined.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
🔁 Rerun
8838 seconds since last update
ah thank you, I see the issue. I'm working on a fix and will release a patch shortly
I've released 0.1.2 which fixes the issue you encountered. Do you mind upgrading and trying again?
As a first step towards supporting a more direct integration I made a function in embed.py
that allows you to create a latentscope embedding from a numpy array. I used your dataset as an example in this notebook:
https://github.com/enjalot/latent-scope/blob/main/notebooks/medium-articles.ipynb
it looks like the VDF file format would have all the parameters you'd need to call this function too
When I run the notebook, I'm getting:
AttributeError Traceback (most recent call last)
Cell In[15], [line 1](vscode-notebook-cell:?execution_count=15&line=1)
----> [1](vscode-notebook-cell:?execution_count=15&line=1) ls.import_embeddings("medium_articles", embeddings, text_column="title", model_id="openai-text-embedding-3-small")
AttributeError: module 'latentscope' has no attribute 'import_embeddings'
even after updating to 0.1.2
When I re-run via UI, I run into https://github.com/enjalot/latent-scope/issues/30 again
Sorry, I didn't include the import_embeddings in v0.1.2 so I just pushed v0.1.3 which should add the function.
That line works in the python notebook now.
When I try to load the file in the Web UI, the UI crashes (without any error logs on server side).
There is some confusion on my setup, since I have imported files with the same name multiple times, and it seems to open the same scope for them. Would be good to create a new scope for each new parquet file loaded in via UI.
Renaming the parquet file and loading it in again works (Web UI loads).
It would be nice to have the existing vector columns (listed on top) as options in the embeddings menu below.
Side note: Making the panes resizable would be nice, as the plot is shown in a narrow section on my machine
yeah I'm planning to make the web UI much smarter, and checking for same dataset name should be high on the list.
in order to use the python line you need to load the dataset already, either via web UI or via python interface so you shouldn't try and upload the dataset via the web UI after you've loaded the embeddings. the notebook example shows the python ingest process. After that you can load the web UI and go to the setup page for the dataset to run umap and clusters.
internally latentscope puts everything related to a dataset in a single folder. there can be multiple embeddings for one dataset, and multiple umaps for each embedding etc.
On Tue, Mar 5, 2024, 12:44 PM Dhruv Anand @.***> wrote:
There is some confusion on my setup, since I have imported files with the same name multiple times, and it seems to open the same scope for them. Would be good to create a new scope for each new parquet file loaded in via UI
— Reply to this email directly, view it on GitHub https://github.com/enjalot/latent-scope/issues/29#issuecomment-1979309542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXPPNWW3MPLX5OEHQBJMLYWYABBAVCNFSM6AAAAABEEFQGQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZGMYDSNJUGI . You are receiving this because you were mentioned.Message ID: @.***>
I'm going to close this issue as I've captured the new ideas in #34 and vector-io compatibility in #32 and we fixed the original bugs you ran into. thank you!
Hi @enjalot,
I'm working on a project called Vector-io https://github.com/AI-Northstar-Tech/vector-io, which allows people to port over their vector datasets across various vector DBs and store snapshots on disk in a simple format called VDF (parquet files and a metadata json file).
I would love to integrate latentscope as a way to visualize the vectors that people have stored in their dataset.
I'm linking to the issue I have in my repo for the integration: https://github.com/AI-Northstar-Tech/vector-io/issues/61.
I wanted to start by asking for help on a bug that I faced while using the web UI to load data from a parquet file in an example dataset I have: https://huggingface.co/datasets/aintech/vdf_20240125_130746_ac5a6_medium_articles/blob/main/medium_articles/medium_articles_2.parquet
I was able to complete the embedding step (though I plan to integrate into the new functionality you're planning for allowing people to use existing vectors), but for the clustering step I got this error: