Open edeutsch opened 2 years ago
Hi Dr. Deutsch, I had a question about what I should do for MS Run name. You mentioned it should just be the start of the file, so would that be the file name: "HFX_9850_GVA_DLD1_2_180719.mzML"? Also for the peaks lists, should I print them out as a string separated by commas or in the array format?
yes, the MS Run Name should be like "HFX_9850_GVA_DLD1_2_180719" without the .mzML And for the lists, I am thinking just separated by tabs (just like the other data columns) is the best thing, so that they are easily parsed as separate columns in Excel.
Hi Dr. Deutsch, Let's say I want 2 minimum_optional_peaks and I pass in 3 optional_peaks (100, 200, and 300). If there are 2 peaks that match 100, but no peaks that match 200 and 300, would that still be valid spectra? Or would 2 minimum_optional_peaks mean I need 2 different peaks from optional_peaks (i.e. 100 and 200)?
Hi Nathan, a minimum of two peaks means at least two of 100, 200, 300. Having two very close peaks within the tolerance at 100 is an unpleasant distraction. It might be clearer to call things "required windows", "optional windows" and "minimum required optional windows"
Got it, thanks! I have updated the code with what we discussed, committed it, and pushed it so we can go over it on Tuesday.
class MSRunPeakFinder
properties defined in the constructor:
- binned_array (4 M elements)
- list of detected peaks
- list of theoretical_ions
- user_parameters (msrun_file_path, file_root, tolerance, bin_size)
methods:
- create_binned_array(msrun_filename)
- find_peaks()
- get_theoretical_ions()
- identify_peaks()
- write()
- make_plots()
main()
- accepts parameters from cmd line and stores them
- create the finder = MSRunPeakFinder object
- create_binned_array(msrun_filename)
- peaks = finder.find_peaks()
- finder.get_theoretical_ions()
- finder.identify_peaks()
- finder.write_to_tsv()
- finder.make_plots()
------
list of dected_peaks:
detected_peaks = []
"triggered_mz": nn.nn,
"centroid_mz": nn.nn,
"intensity": mm.mmm,
"explanations": [
"mz": yy.yyy
"tag": "IH+H2O"
]
I was able to remove all the duplicates close to each other (even when it dips under 50)! I added some pictures for examples of the changes A question about the fork example, which one should be used? The one currently displayed, which has an m/z value closer to the identified amino acid's m/z value, or the peak with the greater intensity (the one in the middle)? Before Fork: After Fork: Before, incorrectly detects the highest peaks: After, correctly detects the highest peaks:
I suggest using a gaussian fit to determine the centroid of the whole shape and just go with that one m/z
Got it, thanks! I also checked why there aren't any pairs of amino acids showing up. It seems after changing the ion type from 'a' to 'b' and the charge from 1 to 0 for the second identifier of the pair, the known pairs of ions' masses are no longer within the tolerance range of the observed peaks (intensity> 50) to be considered identified. For example, a-AA is 115.5865, which is no longer close enough to 115.0865 (peak with intensity of 735)
I'm not quite sure what to suggest except that 115.5865 is definitely wrong. It seems to be off by exactly 0.5000, which is weird. Perhaps an error in some rounding code? maybe when you round to 4 places after the decimal, the +0.5 is in the wrong place?
Thank you for the suggestion, I placed the 0.5 in the wrong spot when rounding - I fixed the code and pushed it with the updated table output and it seems like most of the peaks have been identified. I'll continue with changing the plot outputs and adding a gaussian fit
Great! would you make one more small style change instead of this:
115.0502 153 [115.0502, 0.0, 'INb'] [115.0502, 0.0, 'b-GG']
would you do:
115.0502 153 [115.0502, 0.0, 'b-N'] [115.0502, 0.0, 'b-GG']
i.e. change "INb" (a b type ion for a single N) into "b-N" (a b type ion for a single N). which matches the pattern for two Ns "b-NN"
Yep, I just changed the output and pushed the code
Try writing a find_spectra.py program with the following command-line parameters:
--mzml_file XXXX name of the file to read --precursor_mz n.n Value of the of the precursor m/z --tolerance n.n Tolerance of a specified peak m/z and a measured peak m/z (precursor or fragment ion peak) --required_peaks n.n,n.n,n.n List of fragment ion peaks that must be in the spectrum
output should be a TSV file or printed to screen