write a find_spectra.py

edeutsch commented 2 years ago

Try writing a find_spectra.py program with the following command-line parameters:

--mzml_file XXXX name of the file to read --precursor_mz n.n Value of the of the precursor m/z --tolerance n.n Tolerance of a specified peak m/z and a measured peak m/z (precursor or fragment ion peak) --required_peaks n.n,n.n,n.n List of fragment ion peaks that must be in the spectrum

output should be a TSV file or printed to screen

MS Run name         scan number        precursor m/z        matched peaks list m/zs       matched peaks list intensities

nathanhzh commented 2 years ago

Hi Dr. Deutsch, I had a question about what I should do for MS Run name. You mentioned it should just be the start of the file, so would that be the file name: "HFX_9850_GVA_DLD1_2_180719.mzML"? Also for the peaks lists, should I print them out as a string separated by commas or in the array format?

edeutsch commented 2 years ago

yes, the MS Run Name should be like "HFX_9850_GVA_DLD1_2_180719" without the .mzML And for the lists, I am thinking just separated by tabs (just like the other data columns) is the best thing, so that they are easily parsed as separate columns in Excel.

nathanhzh commented 2 years ago

Hi Dr. Deutsch, Let's say I want 2 minimum_optional_peaks and I pass in 3 optional_peaks (100, 200, and 300). If there are 2 peaks that match 100, but no peaks that match 200 and 300, would that still be valid spectra? Or would 2 minimum_optional_peaks mean I need 2 different peaks from optional_peaks (i.e. 100 and 200)?

edeutsch commented 2 years ago

Hi Nathan, a minimum of two peaks means at least two of 100, 200, 300. Having two very close peaks within the tolerance at 100 is an unpleasant distraction. It might be clearer to call things "required windows", "optional windows" and "minimum required optional windows"

nathanhzh commented 2 years ago

Got it, thanks! I have updated the code with what we discussed, committed it, and pushed it so we can go over it on Tuesday.

edeutsch commented 2 years ago


class MSRunPeakFinder

  properties defined in the constructor:
    - binned_array (4 M elements)
    - list of detected peaks
    - list of theoretical_ions
    - user_parameters (msrun_file_path, file_root, tolerance, bin_size)

  methods:
    - create_binned_array(msrun_filename)
    - find_peaks()
    - get_theoretical_ions()
    - identify_peaks()
    - write()
    - make_plots()

main()
  - accepts parameters from cmd line and stores them
  - create the finder = MSRunPeakFinder object
  - create_binned_array(msrun_filename)
  - peaks = finder.find_peaks()
  - finder.get_theoretical_ions()
  - finder.identify_peaks()
  - finder.write_to_tsv()
  - finder.make_plots()

------

list of dected_peaks:

detected_peaks = []
    "triggered_mz": nn.nn,
      "centroid_mz": nn.nn,
      "intensity": mm.mmm,
      "explanations": [
          "mz": yy.yyy
          "tag": "IH+H2O"
       ]

nathanhzh commented 2 years ago

I was able to remove all the duplicates close to each other (even when it dips under 50)! I added some pictures for examples of the changes A question about the fork example, which one should be used? The one currently displayed, which has an m/z value closer to the identified amino acid's m/z value, or the peak with the greater intensity (the one in the middle)? Before Fork: After Fork: Before, incorrectly detects the highest peaks: After, correctly detects the highest peaks:

edeutsch commented 2 years ago

I suggest using a gaussian fit to determine the centroid of the whole shape and just go with that one m/z

nathanhzh commented 2 years ago

Got it, thanks! I also checked why there aren't any pairs of amino acids showing up. It seems after changing the ion type from 'a' to 'b' and the charge from 1 to 0 for the second identifier of the pair, the known pairs of ions' masses are no longer within the tolerance range of the observed peaks (intensity> 50) to be considered identified. For example, a-AA is 115.5865, which is no longer close enough to 115.0865 (peak with intensity of 735)

edeutsch commented 2 years ago

I'm not quite sure what to suggest except that 115.5865 is definitely wrong. It seems to be off by exactly 0.5000, which is weird. Perhaps an error in some rounding code? maybe when you round to 4 places after the decimal, the +0.5 is in the wrong place?

nathanhzh commented 2 years ago

Thank you for the suggestion, I placed the 0.5 in the wrong spot when rounding - I fixed the code and pushed it with the updated table output and it seems like most of the peaks have been identified. I'll continue with changing the plot outputs and adding a gaussian fit

edeutsch commented 2 years ago

Great! would you make one more small style change instead of this:

115.0502    153 [115.0502, 0.0, 'INb']  [115.0502, 0.0, 'b-GG']

would you do:

115.0502    153 [115.0502, 0.0, 'b-N']  [115.0502, 0.0, 'b-GG']

i.e. change "INb" (a b type ion for a single N) into "b-N" (a b type ion for a single N). which matches the pattern for two Ns "b-NN"

nathanhzh commented 2 years ago

Yep, I just changed the output and pushed the code

PlantProteomes / SpectrumReader

write a find_spectra.py #1