Open YUANMENG-1 opened 8 months ago
Hi @YUANMENG-1. I'm able to get the environment to solve on my machine, but it does indeed take a long time. I'll look into improvments I can make to the environment.yml to fix this issue. In the meantime, maybe try using Mamba? It is a drop in replacment for Conda and typically is much faster at building environments. I gave it a try with blink's environment and it finished in just a few minutes:
mamba env create -f environment.yml
Let me know how it goes and if you need any additional help getting BLINK going.
Thanks for your suggestions, I could get the blink through mamba,
Hi @YUANMENG-1. I'm able to get the environment to solve on my machine, but it does indeed take a long time. I'll look into improvments I can make to the environment.yml to fix this issue. In the meantime, maybe try using Mamba? It is a drop in replacment for Conda and typically is much faster at building environments. I gave it a try with blink's environment and it finished in just a few minutes:
mamba env create -f environment.yml
Let me know how it goes and if you need any additional help getting BLINK going.
Sorry I met another problem:
When running the demo data:
python3 -m blink.blink_cli ./example/accuracy_test_data/small.mgf ./example/accuracy_test_data/medium.mgf ./blink2out.csv ./models/positive_random_forest.pickle ./models/negative_random_forest.pickle positive --min_predict 0.01 --mass_diffs 0 14.0157 12.000 15.9949 2.01565 27.9949 26.0157 18.0106 30.0106 42.0106 1.9792 17.00284 24.000 13.97925 1.00794 40.0313
running information seems to be "The warning highlights a potential risk. Using a model trained with scikit-learn version 1.0.2 and loading it with version 1.4.1.post1 might lead to unexpected behavior or invalid results." this problem
use "conda install scikit-learn=1.0.2" can solve this:
INFO:root:Processing small.mgf
INFO:root:Processing medium.mgf
/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator DecisionTreeRegressor from version 1.0.2 when using version 1.4.1.post1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
Traceback (most recent call last):
File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/public/home/yuanmy/blink/blink/blink_cli.py", line 5, in
Hi @YUANMENG-1. I'm able to get the environment to solve on my machine, but it does indeed take a long time. I'll look into improvments I can make to the environment.yml to fix this issue. In the meantime, maybe try using Mamba? It is a drop in replacment for Conda and typically is much faster at building environments. I gave it a try with blink's environment and it finished in just a few minutes:
mamba env create -f environment.yml
Let me know how it goes and if you need any additional help getting BLINK going.
Sorry again, this time occurred some new problems:which may be "can not find the charge column?"
python3 -m blink.blink_cli ./example/accuracy_test_data/small.mgf ./example/accuracy_test_data/medium.mgf ./blink2out.csv ./models/positive_random_forest.pickle ./models/negative_random_forest.pickle positive --min_predict 0.01 --mass_diffs 0 14.0157 12.000 15.9949 2.01565 27.9949 26.0157 18.0106 30.0106 42.0106 1.9792 17.00284 24.000 13.97925 1.00794 40.0313
INFO:root:Processing small.mgf INFO:root:Processing medium.mgf INFO:root:Input files read time: 2.8513012938201427 seconds, 7010 spectra INFO:root:Discretization time: 2.101361535489559 seconds, 7010 spectra INFO:root:Scoring time: 16.316090885549784 seconds, 6010000 comparisons INFO:root:Prediction time: 22.110017083585262 seconds, 6010000 comparisons Traceback (most recent call last): File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'charge'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/public/home/yuanmy/blink/blink/blink_cli.py", line 5, in
Hi @YUANMENG-1. Sorry you are having issues with the command line implementation. The CLI is still under active development and has some experimental features that are currently outside of the scope of what was published in the BLINK paper. For standard use, we recomend following the examples in the tutorial notebook
Hi @YUANMENG-1. Sorry you are having issues with the command line implementation. The CLI is still under active development and has some experimental features that are currently outside of the scope of what was published in the BLINK paper. For standard use, we recomend following the examples in the tutorial notebook
Sorry again to bother you, When I wanted to compare two mgf files by imitating your mzml compared mgf tutorial in your link: https://github.com/biorack/blink/blob/main/tutorial/blink_tutorial.ipynb The results don't look right: (with the results file uploaded)
The problems in the output file are as follows:
import sys import blink import pandas as pd
mgf_query = blink.open_msms_file('../SpectralEntropy-master/neg_slaw_modi7.mgf') mgf_ref= blink.open_msms_file('../SpectralEntropy-master/all_gnps_neg.mgf')
discretized_spectra = blink.discretize_spectra(mgf_ref.spectrum.tolist(), mgf_query.spectrum.tolist(), mgf_ref.precursor_mz.tolist(), mgf_query.precursor_mz.tolist(),bin_width=0.001, tolerance=0.01, intensity_power=0.5, trim_empty=False, remove_duplicates=False, network_score=False)
%%time S12 = blink.score_sparse_spectra(discretized_spectra) S12['mzi'] S12['mzc']
filtered_S12 = blink.filter_hits(S12, min_matches=5, override_matches=20, min_score=0.6) m = blink.reformat_score_matrix(filtered_S12) df = blink.make_output_df(m) df = df.sparse.to_dense() df = pd.merge(df, mgf_ref.add_suffix("_ref"), left_on="ref", right_index=True) df = pd.merge(df, mgf_query.add_suffix("_query"), left_on="query", right_index=True)
df.to_csv('output_test.csv', index=False)
No problem, I'm happy to help. Your code looks good, it seems like the issue is merging in the metadata. I think what you need to do is switch the order of your mgf_query and mgf_ref in blink.discretize_spectra() and try it again. The "query" column in the output dataframe always corresponds to indices of the first set of spectra, while the "ref" column is the indices of the second set of spectra. This is easy to mix up, I do need to improve documentation on how this works. I recently fixed the same issue from the tutorial notebook itself, so check out the most recent version if you have an older one.
As far as your other question about the spectra, those look okay to me. It is less obvious when the arrays are converted to strings as a saved csv, but each spectrum is modeled as an array of two arrays. The first array is for m/z, and the second array is for their intensities. For instance, the first "spectrum_ref" entry is the following:
[[54.1031, 66.751411, 68.216377, 116.144203, 123.719292, 136.225159, 149.481293, 149.565704, 150.117798, 280.409973], [1980., 2953., 2763., 2169., 2030., 2355. 2194, 2409, 2400, 2254]]
The first list there are the m/z values, and the second are the intensities. Hopefully this helps!
I followed your suggestions "switch the order of your mgf_query and mgf_ref in blink.discretize_spectra() and try it again"。
It appears to be getting stuck by output larger tables, and the size of the files being output is much larger than if the blinker.discretize_spectra () order had not been replaced.
Does the blink.discretize_spectra() order determine the side length of the entire computed sparse matrix, so whether it is database mgf or query mgf, the file with the higher number of spectra should be placed first? Is there a way to filter the output result file without the mz and intensity of query and ref?
My guess is that the reason your output is so much larger is that the metadata is now being associated correctly, though there could be something else going on. This appears to be a pretty big comparison, so the pd.merge adds a lot of extra content to the output dataframe. The order of the spectra in the discretize_spectra function shouldn't change the size of the score matrix. The algorithm is more efficient when the smaller set of spectra is first, but it shouldn't make a huge difference (query is typically smaller than ref).
This is more of a pandas question than a blink question, however, I can give you some suggestions. If you want to decrease the size of your outputs, you can filter the output dataframe by score or number of matches before adding metadata. If you already filtered those, then you can chose to only associate essential metdata instead of everything read from the mgf files with the merge. For instance:
df = pd.merge(df, mgf_ref[['pepmass', 'title']].add_suffix("_ref"), left_on="ref", right_index=True)
If I/O and file size is a concern, maybe look into using parquet files or similar, rather than csv. Good luck!
and whether in HPC or mac all met this
Originally posted by @YUANMENG-1 in https://github.com/biorack/blink/issues/4#issuecomment-1978969929