MannLabs / alphapeptdeep

Deep learning framework for proteomics
Apache License 2.0
108 stars 20 forks source link

normalize_fragment_intensities not normalizing intensities #102

Closed yangkl96 closed 1 year ago

yangkl96 commented 1 year ago

Describe the bug On my local machine, training losses looked regular. When I used a linux server, the training losses were huge (see attached screenshot). I pinned it down to this line in peptdeep.model.ms2.normalize_fragment_intensities: frag_intensity_df.values[frag_start_idx:frag_stop_idx,:] = intens

The use of values instead of iloc did not update the frag_intensity_df. Could I recommend iloc here, along with any other place where values are updated in a pandas dataframe?

Screenshots image

Version (please complete the following information): On local machine 2023-05-21 19:09:47> Python information: 2023-05-21 19:09:47> alphabase - 1.0.2 2023-05-21 19:09:47> biopython - 1.81 2023-05-21 19:09:47> click - 8.1.3 2023-05-21 19:09:47> lxml - 4.9.1 2023-05-21 19:09:47> numba - 0.57.0 2023-05-21 19:09:47> numpy - 1.24.3 2023-05-21 19:09:47> pandas - 1.5.3 2023-05-21 19:09:47> peptdeep - 1.0.2 2023-05-21 19:09:47> psutil - 5.9.0 2023-05-21 19:09:47> python - 3.9.16 2023-05-21 19:09:47> scikit-learn - 1.2.2 2023-05-21 19:09:47> streamlit - 1.20.0 2023-05-21 19:09:47> streamlit-aggrid - 0.3.4.post3 2023-05-21 19:09:47> torch - 1.12.1 2023-05-21 19:09:47> tqdm - 4.65.0 2023-05-21 19:09:47> transformers - 4.24.0

On Linux server 2023-05-21 19:32:42> Python information: 2023-05-21 19:32:42> alphabase - 1.0.2 2023-05-21 19:32:42> biopython - 1.79 2023-05-21 19:32:42> click - 8.1.3 2023-05-21 19:32:42> lxml - 4.9.2 2023-05-21 19:32:42> numba - 0.53.0 2023-05-21 19:32:42> numpy - 1.20.3 2023-05-21 19:32:42> pandas - 2.0.1 2023-05-21 19:32:42> peptdeep - 1.0.2 2023-05-21 19:32:42> psutil - 5.9.5 2023-05-21 19:32:42> python - 3.9.12 2023-05-21 19:32:42> scikit-learn - 1.2.2 2023-05-21 19:32:42> streamlit - 1.12.0 2023-05-21 19:32:42> streamlit-aggrid - 0.3.4.post3 2023-05-21 19:32:42> torch - 1.12.1 2023-05-21 19:32:42> tqdm - 4.65.0 2023-05-21 19:32:42> transformers - 4.23.1

jalew188 commented 1 year ago

Yes, for small dataframes, .iloc is more accurate than .values. I will change it accordingly.

yangkl96 commented 1 year ago

I might have a speedup for this function. I will test and post the results later

yangkl96 commented 1 year ago

image I found this to be a good compromise. On my local desktop, the original code 0.88 sec for this function; changing values to iloc took 2.5 sec; the code in the image above took 2.5 sec. On the server, the original code did not properly update frag_intensity_df; changing values to iloc took 347 sec; the code in the image above took 12.5 sec.

I think differences in the server architecture lead to these differences. I am ok with just changing values to iloc, using the code posted above, or finding someway to ensure that frag_intensity_df.values returns a view and not a copy of the dataframe.

jalew188 commented 1 year ago

It looks good, can you make a PR with it? @yangkl96

jalew188 commented 1 year ago

We cannot ensure .values or .to_numpy() returns a view, this is the key issue. So maybe using your code above to make it more accurate while retaining the speed

yangkl96 commented 1 year ago

I have sent the PR

jalew188 commented 1 year ago

103 close this issue