enjalot / latent-scope

A scientific instrument for investigating latent spaces
MIT License
570 stars 19 forks source link

Integrate latentscope into Vector-io #29

Closed dhruv-anand-aintech closed 7 months ago

dhruv-anand-aintech commented 8 months ago

Hi @enjalot,

I'm working on a project called Vector-io https://github.com/AI-Northstar-Tech/vector-io, which allows people to port over their vector datasets across various vector DBs and store snapshots on disk in a simple format called VDF (parquet files and a metadata json file).

I would love to integrate latentscope as a way to visualize the vectors that people have stored in their dataset.

I'm linking to the issue I have in my repo for the integration: https://github.com/AI-Northstar-Tech/vector-io/issues/61.

I wanted to start by asking for help on a bug that I faced while using the web UI to load data from a parquet file in an example dataset I have: https://huggingface.co/datasets/aintech/vdf_20240125_130746_ac5a6_medium_articles/blob/main/medium_articles/medium_articles_2.parquet

I was able to complete the embedding step (though I plan to integrate into the new functionality you're planning for allowing people to use existing vectors), but for the clustering step I got this error:

Loading environment variables from: /Users/dhruvanand/Code/vector-io/.env
loading embeddings
RUNNING: umap-001
loading umap None
Traceback (most recent call last):
File "/Users/dhruvanand/miniforge3/bin/ls-umap", line 8, in <module>
sys.exit(main())
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/latentscope/scripts/umapper.py", line 29, in main
umapper(args.dataset_id, args.embedding_id, args.neighbors, args.min_dist, save=args.save, init=args.init, align=args.align)
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/latentscope/scripts/umapper.py", line 153, in umapper
initial_df = pd.read_parquet(os.path.join(umap_dir, f"{init}.parquet"))
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/pandas/io/parquet.py", line 670, in read_parquet
return impl.read(
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/pandas/io/parquet.py", line 265, in read
path_or_handle, handles, filesystem = _get_path_or_handle(
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/pandas/io/parquet.py", line 139, in _get_path_or_handle
handles = get_handle(
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/pandas/io/common.py", line 872, in get_handle
handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: './medium_articles_2/umaps/None.parquet'
enjalot commented 8 months ago

Oh no, I think I know what the issue is here. Could you go to the Job History page on the setup for your medium_articls_2 dataset? You should see the umap command that errored. it would be helpful to know the full command it tried to execute.

dhruv-anand-aintech commented 8 months ago
info_map job 577ae365-c579-4fe7-bacf-11f0db68fa4c
error
Running umap
ls-umap info_map undefined 25 0.1 --init=None
Loading environment variables from: /Users/dhruvanand/Code/vector-io/.env
loading embeddings
Traceback (most recent call last):
File "/Users/dhruvanand/miniforge3/bin/ls-umap", line 8, in <module>
sys.exit(main())
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/latentscope/scripts/umapper.py", line 29, in main
umapper(args.dataset_id, args.embedding_id, args.neighbors, args.min_dist, save=args.save, init=args.init, align=args.align)
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/latentscope/scripts/umapper.py", line 49, in umapper
with h5py.File(embedding_path, 'r') as f:
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/h5py/_hl/files.py", line 562, in __init__
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/h5py/_hl/files.py", line 235, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 102, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = './info_map/embeddings/undefined.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
🔁 Rerun
8838 seconds since last update
enjalot commented 8 months ago

ah thank you, I see the issue. I'm working on a fix and will release a patch shortly

enjalot commented 8 months ago

I've released 0.1.2 which fixes the issue you encountered. Do you mind upgrading and trying again?

enjalot commented 8 months ago

As a first step towards supporting a more direct integration I made a function in embed.py that allows you to create a latentscope embedding from a numpy array. I used your dataset as an example in this notebook: https://github.com/enjalot/latent-scope/blob/main/notebooks/medium-articles.ipynb

it looks like the VDF file format would have all the parameters you'd need to call this function too

dhruv-anand-aintech commented 8 months ago

When I run the notebook, I'm getting:

AttributeError                            Traceback (most recent call last)
Cell In[15], [line 1](vscode-notebook-cell:?execution_count=15&line=1)
----> [1](vscode-notebook-cell:?execution_count=15&line=1) ls.import_embeddings("medium_articles", embeddings, text_column="title", model_id="openai-text-embedding-3-small")

AttributeError: module 'latentscope' has no attribute 'import_embeddings'

even after updating to 0.1.2

dhruv-anand-aintech commented 8 months ago

When I re-run via UI, I run into https://github.com/enjalot/latent-scope/issues/30 again

enjalot commented 8 months ago

Sorry, I didn't include the import_embeddings in v0.1.2 so I just pushed v0.1.3 which should add the function.

dhruv-anand-aintech commented 8 months ago

That line works in the python notebook now.

When I try to load the file in the Web UI, the UI crashes (without any error logs on server side).

dhruv-anand-aintech commented 8 months ago

There is some confusion on my setup, since I have imported files with the same name multiple times, and it seems to open the same scope for them. Would be good to create a new scope for each new parquet file loaded in via UI.

Renaming the parquet file and loading it in again works (Web UI loads). Screenshot 2024-03-05 at 11 16 24 PM

It would be nice to have the existing vector columns (listed on top) as options in the embeddings menu below.

Side note: Making the panes resizable would be nice, as the plot is shown in a narrow section on my machine

enjalot commented 8 months ago

yeah I'm planning to make the web UI much smarter, and checking for same dataset name should be high on the list.

in order to use the python line you need to load the dataset already, either via web UI or via python interface so you shouldn't try and upload the dataset via the web UI after you've loaded the embeddings. the notebook example shows the python ingest process. After that you can load the web UI and go to the setup page for the dataset to run umap and clusters.

internally latentscope puts everything related to a dataset in a single folder. there can be multiple embeddings for one dataset, and multiple umaps for each embedding etc.

On Tue, Mar 5, 2024, 12:44 PM Dhruv Anand @.***> wrote:

There is some confusion on my setup, since I have imported files with the same name multiple times, and it seems to open the same scope for them. Would be good to create a new scope for each new parquet file loaded in via UI

— Reply to this email directly, view it on GitHub https://github.com/enjalot/latent-scope/issues/29#issuecomment-1979309542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXPPNWW3MPLX5OEHQBJMLYWYABBAVCNFSM6AAAAABEEFQGQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZGMYDSNJUGI . You are receiving this because you were mentioned.Message ID: @.***>

enjalot commented 7 months ago

I'm going to close this issue as I've captured the new ideas in #34 and vector-io compatibility in #32 and we fixed the original bugs you ran into. thank you!