biorack / blink

Other
8 stars 2 forks source link

failed every time i try the "conda env create -f environment.yml", it will stop in solving environment here in the picture, and finally failed to continue, I remember last year I also met this problem and failed to try blink, and I install all the dependency before running the environment.yml #7

Open YUANMENG-1 opened 8 months ago

YUANMENG-1 commented 8 months ago
          Excuse me, every time i try the "conda env create -f environment.yml", it will stop in solving environment here in the picture, and finally failed to continue, I remember last year I also met this problem and failed to try blink, and I install all the dependency before running the environment.yml

and whether in HPC or mac all met this

image

Originally posted by @YUANMENG-1 in https://github.com/biorack/blink/issues/4#issuecomment-1978969929

tharwood3 commented 8 months ago

Hi @YUANMENG-1. I'm able to get the environment to solve on my machine, but it does indeed take a long time. I'll look into improvments I can make to the environment.yml to fix this issue. In the meantime, maybe try using Mamba? It is a drop in replacment for Conda and typically is much faster at building environments. I gave it a try with blink's environment and it finished in just a few minutes:

mamba env create -f environment.yml

Let me know how it goes and if you need any additional help getting BLINK going.

YUANMENG-1 commented 8 months ago

Thanks for your suggestions, I could get the blink through mamba,

image
YUANMENG-1 commented 8 months ago

Hi @YUANMENG-1. I'm able to get the environment to solve on my machine, but it does indeed take a long time. I'll look into improvments I can make to the environment.yml to fix this issue. In the meantime, maybe try using Mamba? It is a drop in replacment for Conda and typically is much faster at building environments. I gave it a try with blink's environment and it finished in just a few minutes:

mamba env create -f environment.yml

Let me know how it goes and if you need any additional help getting BLINK going.

Sorry I met another problem:

When running the demo data:

python3 -m blink.blink_cli ./example/accuracy_test_data/small.mgf ./example/accuracy_test_data/medium.mgf ./blink2out.csv ./models/positive_random_forest.pickle ./models/negative_random_forest.pickle positive --min_predict 0.01 --mass_diffs 0 14.0157 12.000 15.9949 2.01565 27.9949 26.0157 18.0106 30.0106 42.0106 1.9792 17.00284 24.000 13.97925 1.00794 40.0313

running information seems to be "The warning highlights a potential risk. Using a model trained with scikit-learn version 1.0.2 and loading it with version 1.4.1.post1 might lead to unexpected behavior or invalid results." this problem

use "conda install scikit-learn=1.0.2" can solve this:

INFO:root:Processing small.mgf INFO:root:Processing medium.mgf /public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator DecisionTreeRegressor from version 1.0.2 when using version 1.4.1.post1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations warnings.warn( Traceback (most recent call last): File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/public/home/yuanmy/blink/blink/blink_cli.py", line 5, in main() File "/public/home/yuanmy/blink/blink/blink.py", line 279, in main regressor = pickle.load(out) File "sklearn/tree/_tree.pyx", line 865, in sklearn.tree._tree.Tree.setstate File "sklearn/tree/_tree.pyx", line 1571, in sklearn.tree._tree._check_node_ndarray ValueError: node array from the pickle has an incompatible dtype:

YUANMENG-1 commented 8 months ago

Hi @YUANMENG-1. I'm able to get the environment to solve on my machine, but it does indeed take a long time. I'll look into improvments I can make to the environment.yml to fix this issue. In the meantime, maybe try using Mamba? It is a drop in replacment for Conda and typically is much faster at building environments. I gave it a try with blink's environment and it finished in just a few minutes:

mamba env create -f environment.yml

Let me know how it goes and if you need any additional help getting BLINK going.

Sorry again, this time occurred some new problems:which may be "can not find the charge column?"

python3 -m blink.blink_cli ./example/accuracy_test_data/small.mgf ./example/accuracy_test_data/medium.mgf ./blink2out.csv ./models/positive_random_forest.pickle ./models/negative_random_forest.pickle positive --min_predict 0.01 --mass_diffs 0 14.0157 12.000 15.9949 2.01565 27.9949 26.0157 18.0106 30.0106 42.0106 1.9792 17.00284 24.000 13.97925 1.00794 40.0313

INFO:root:Processing small.mgf INFO:root:Processing medium.mgf INFO:root:Input files read time: 2.8513012938201427 seconds, 7010 spectra INFO:root:Discretization time: 2.101361535489559 seconds, 7010 spectra INFO:root:Scoring time: 16.316090885549784 seconds, 6010000 comparisons INFO:root:Prediction time: 22.110017083585262 seconds, 6010000 comparisons Traceback (most recent call last): File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'charge'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/public/home/yuanmy/blink/blink/blink_cli.py", line 5, in main() File "/public/home/yuanmy/blink/blink/blink.py", line 334, in main output = pd.merge(output, query_df['charge'], left_on='query', right_index=True) File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/site-packages/pandas/core/frame.py", line 4090, in getitem indexer = self.columns.get_loc(key) File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc raise KeyError(key) from err KeyError: 'charge'

tharwood3 commented 8 months ago

Hi @YUANMENG-1. Sorry you are having issues with the command line implementation. The CLI is still under active development and has some experimental features that are currently outside of the scope of what was published in the BLINK paper. For standard use, we recomend following the examples in the tutorial notebook

YUANMENG-1 commented 8 months ago

Hi @YUANMENG-1. Sorry you are having issues with the command line implementation. The CLI is still under active development and has some experimental features that are currently outside of the scope of what was published in the BLINK paper. For standard use, we recomend following the examples in the tutorial notebook

Sorry again to bother you, When I wanted to compare two mgf files by imitating your mzml compared mgf tutorial in your link: https://github.com/biorack/blink/blob/main/tutorial/blink_tutorial.ipynb The results don't look right: (with the results file uploaded)

output_test.csv

The problems in the output file are as follows:

  1. The query columns and ref columns in columns 3 and 4 are all the titles of query, and columns 6 and 10 appear to be the titles of query and ref. Which one is correct? I can determine whether the two files are compared with each other or query and query themselves? 2.spectrum_query and spectrum_ref do not look right, as they only contain intensity information (I understand spectrum_ref to be both Intensity and mz?).
  2. Is the above error because I used your protocol incorrectly? Since the upstream usually gets the merge.mgf file, I want to compare the two MGFS by changing your tutorial, as follows:

import sys import blink import pandas as pd

mgf_query = blink.open_msms_file('../SpectralEntropy-master/neg_slaw_modi7.mgf') mgf_ref= blink.open_msms_file('../SpectralEntropy-master/all_gnps_neg.mgf')

discretized_spectra = blink.discretize_spectra(mgf_ref.spectrum.tolist(), mgf_query.spectrum.tolist(), mgf_ref.precursor_mz.tolist(), mgf_query.precursor_mz.tolist(),bin_width=0.001, tolerance=0.01, intensity_power=0.5, trim_empty=False, remove_duplicates=False, network_score=False)

%%time S12 = blink.score_sparse_spectra(discretized_spectra) S12['mzi'] S12['mzc']

filtered_S12 = blink.filter_hits(S12, min_matches=5, override_matches=20, min_score=0.6) m = blink.reformat_score_matrix(filtered_S12) df = blink.make_output_df(m) df = df.sparse.to_dense() df = pd.merge(df, mgf_ref.add_suffix("_ref"), left_on="ref", right_index=True) df = pd.merge(df, mgf_query.add_suffix("_query"), left_on="query", right_index=True)

df.to_csv('output_test.csv', index=False)

tharwood3 commented 8 months ago

No problem, I'm happy to help. Your code looks good, it seems like the issue is merging in the metadata. I think what you need to do is switch the order of your mgf_query and mgf_ref in blink.discretize_spectra() and try it again. The "query" column in the output dataframe always corresponds to indices of the first set of spectra, while the "ref" column is the indices of the second set of spectra. This is easy to mix up, I do need to improve documentation on how this works. I recently fixed the same issue from the tutorial notebook itself, so check out the most recent version if you have an older one.

As far as your other question about the spectra, those look okay to me. It is less obvious when the arrays are converted to strings as a saved csv, but each spectrum is modeled as an array of two arrays. The first array is for m/z, and the second array is for their intensities. For instance, the first "spectrum_ref" entry is the following:

[[54.1031, 66.751411, 68.216377, 116.144203, 123.719292, 136.225159, 149.481293, 149.565704, 150.117798, 280.409973], [1980., 2953., 2763., 2169., 2030., 2355. 2194, 2409, 2400, 2254]]

The first list there are the m/z values, and the second are the intensities. Hopefully this helps!

YUANMENG-1 commented 8 months ago

I followed your suggestions "switch the order of your mgf_query and mgf_ref in blink.discretize_spectra() and try it again"。

It appears to be getting stuck by output larger tables, and the size of the files being output is much larger than if the blinker.discretize_spectra () order had not been replaced.

Does the blink.discretize_spectra() order determine the side length of the entire computed sparse matrix, so whether it is database mgf or query mgf, the file with the higher number of spectra should be placed first? Is there a way to filter the output result file without the mz and intensity of query and ref?

image
tharwood3 commented 8 months ago

My guess is that the reason your output is so much larger is that the metadata is now being associated correctly, though there could be something else going on. This appears to be a pretty big comparison, so the pd.merge adds a lot of extra content to the output dataframe. The order of the spectra in the discretize_spectra function shouldn't change the size of the score matrix. The algorithm is more efficient when the smaller set of spectra is first, but it shouldn't make a huge difference (query is typically smaller than ref).

This is more of a pandas question than a blink question, however, I can give you some suggestions. If you want to decrease the size of your outputs, you can filter the output dataframe by score or number of matches before adding metadata. If you already filtered those, then you can chose to only associate essential metdata instead of everything read from the mgf files with the merge. For instance:

df = pd.merge(df, mgf_ref[['pepmass', 'title']].add_suffix("_ref"), left_on="ref", right_index=True)

If I/O and file size is a concern, maybe look into using parquet files or similar, rather than csv. Good luck!