Chris7 / pyquant

Platform independent command line tool for analysis of mass spectrometry data.
https://chris7.github.io/pyquant/
MIT License
15 stars 6 forks source link

mva query #33

Closed rpm78 closed 4 years ago

rpm78 commented 5 years ago

@Chris7 I am running into difficulties using pyQuant and would be grateful for any advice.

The samples are neucode (duplex). RAWs have been processed with MaxQuant (1.6.5) which incorporates a neucode search modality that post-dates the release of pyquant (not sure if that matters). When the following command is entered without the --mva flag data is loaded normally.

docker run -v /home/rpm/data/winserver/RMtest-output/RMtest:/RMtest/ chrismit7/pyquant:v0.2.2 -p 8 --neucode --overlapping-labels --label-scheme /RMtest/neucode.tsv --min-resolution 400000 --out /RMtest/evidence-trimmed-twice-out --tsv /RMtest/evidence-trimmed-twice.txt --peptide-col Sequence --rt "Retention time" --mz Mass --scan-col "Scan number" --charge Charge --source "Raw file"
msparser not found, Mascot DAT files unable to be parsed
Loading Scans:
.
Scans loaded.
Beginning quantification.
Processing /RMtest/ts_171018_RM143_Run1_2.mzML
................................................................................ts_171018_RM143_Run1_2 processed and placed into queue.

When the --mva flag is added, the following error is reported.

msparser not found, Mascot DAT files unable to be parsed
Loading Scans:
.
Scans loaded.
Beginning quantification.
Processing /RMtest/ts_171018_RM143_Run1_2.mzML.
..............................
Traceback (most recent call last):
File "/usr/local/bin/pyQuant", line 11, in <module>
load_entry_point('pyquant-ms==0.2.2', 'console_scripts', 'pyQuant')()
File "/usr/local/lib/python2.7/dist-packages/pyquant/command_line.py", line 597, in run_pyquant
ions = [i['id_scan'].get('theor_mass', i['id_scan']['mass']) for i in raw_scans] if args.mva else raw_scans['ions']
TypeError: string indices must be integers

Any advice on best ways to proceed much appreciated.
Thanks for making pyQuant!

Chris7 commented 5 years ago

Hi @rpm78,

There is absolutely a bug in the text-file based code, I'll handle fixing that. But a few other questions/info:

rpm78 commented 5 years ago

@Chris7

Thanks for telling me and for the quick answer. I know it is a drag when changes in ancilliary programs are likely causing the issue.

I did try the --maxquant flag initially, but received an error, and so retreated to the column-mapping. I initially assumed that the extra neucode-related columns might be wrecking the input.

Your question prompted me to try the flag again. The error I initially encountered was probably just due to a change in the capitalization of one column label ("MS/MS Scan Number" is now "MS/MS Scan number" in the evidence file). Here is the output in case you wanted to see it. So changing that helped.

docker run -v /home/rpm/data/winserver/RMtest-output/RMtest:/RMtest/ chrismit7/pyquant:v0.2.2 -p 8 --maxquant --neucode --overlapping-labels --label-scheme /RMtest/neucode.tsv --min-resolution 400000 --out /RMtest/RM143_Run1_2_only_i_out --tsv /RMtest/evidence.txt --scan-file-dir /RMtest/
msparser not found, Mascot DAT files unable to be parsed
Loading Scans:
sys:1: DtypeWarning: Columns (60) have mixed types. Specify dtype option on import or set low_memory=False.
.Traceback (most recent call last):
  File "/usr/local/bin/pyQuant", line 11, in <module>
    load_entry_point('pyquant-ms==0.2.2', 'console_scripts', 'pyQuant')()
  File "/usr/local/lib/python2.7/dist-packages/pyquant/command_line.py", line 208, in run_pyquant
    specId = str(i[scan_col])
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 623, in __getitem__
    result = self.index.get_value(self, key)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 2574, in get_value
    raise e1
KeyError: u'MS/MS Scan Number'

Regarding missing value analysis, yes that is what I was trying to do.
Incorrectly probably. To improve I tried to feed in the peptides unique to one of two technical replicate scans using the --peptide-file flag, so they would be searched for in the second replicate scan file, and received similar output to the first I described. The scan file directory contained only the second replicate mzML and RAW files.

docker run -v /home/rpm/data/winserver/RMtest-output/RMtest:/RMtest/ chrismit7/pyquant:v0.2.2 -p 8 --maxquant --neucode --overlapping-labels --label-scheme /RMtest/neucode.tsv --min-resolution 400000 --out /RMtest/mva-out --peptide-file /RMtest/RM143_Run1_only_evidence.txt --scan-file-dir /RMtest/mva-mzMLs --mva
msparser not found, Mascot DAT files unable to be parsed
Loading Scans:

Scans loaded.
Beginning quantification.
Processing /RMtest/mva-mzMLs/ts_171018_RM143_Run1_2.mzML.
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Traceback (most recent call last):
  File "/usr/local/bin/pyQuant", line 11, in <module>
    load_entry_point('pyquant-ms==0.2.2', 'console_scripts', 'pyQuant')()
  File "/usr/local/lib/python2.7/dist-packages/pyquant/command_line.py", line 597, in run_pyquant
    ions = [i['id_scan'].get('theor_mass', i['id_scan']['mass']) for i in raw_scans] if args.mva else raw_scans['ions']
TypeError: 'int' object has no attribute '__getitem__'

Hoping that helps. Thanks again for taking a look at this.

Chris7 commented 5 years ago

Thanks for the update. Not having a way to get the version for MQ is a bit of a pain. I just made a new release, could you try with the docker image chrismit7/pyquant:v0.2.3-rc1? I am not 100% sure if it will fix this problem, as it involved refactoring the code so both the delimited-base parsing and mass-spec result formats use the same function. I think this should just work, but don't have any files on hand atm to test with.

rpm78 commented 5 years ago

I couldn't pull chrismit7/pyquant:v0.2.3-rc1 and received Error response from daemon: manifest for chrismit7/pyquant:v0.2.3-rc1 not found

Will look forward to trying once it is up. Would it be of use for me to upload files to a dropbox folder if the fix doesn't work--or for future testing, even if it does?

Chris7 commented 5 years ago

I just pushed and confirmed it is on dockerhub, guess my upload failed last time. Having the files would be great to help debug or test with as well, so having them on dropbox to play with sounds great.

rpm78 commented 5 years ago

I tried out rc1 and received the same result (below).

Great about the files. If it's ok I will send a link to the email listed in your MCP paper.

docker run -v /home/rpm/data/winserver/RMtest-output/RMtest:/RMtest/ chrismit7/pyquant:v0.2.3-rc1 -p 8 --maxquant --neucode --overlapping-labels --label-scheme /RMtest/neucode.tsv --min-resolution 400000 --out /RMtest/mva-out-0223rc1 --peptide-file /RMtest/RM143_Run1_only_evidence.txt --scan-file-dir /RMtest/mva-mzMLs --mva
msparser not found, Mascot DAT files unable to be parsed
Loading Scans:

Scans loaded.
Beginning quantification.
Processing /RMtest/mva-mzMLs/ts_171018_RM143_Run1_2.mzML.
..........................Traceback (most recent call last):
  File "/usr/local/bin/pyQuant", line 11, in <module>
    load_entry_point('pyquant-ms==0.2.3', 'console_scripts', 'pyQuant')()
  File "/usr/local/lib/python2.7/dist-packages/pyquant/command_line.py", line 617, in run_pyquant
    ions = [i['id_scan'].get('theor_mass', i['id_scan']['mass']) for i in raw_scans] if args.mva else raw_scans['ions']
TypeError: 'int' object has no attribute '__getitem__'
Chris7 commented 5 years ago

Hey @rpm78,

I finally managed to take a look at this and am working on a branch to fix the issue. I have been a bit swamped recently so it might take a little bit.

rpm78 commented 5 years ago

Hey @Chris7, Great!

Chris7 commented 5 years ago

Sorry for the delay. I took some time to look at the actual files themselves, could you give me a bit more information about the experimental setup? Is there no unlabeled peptide?

Also, I'm trying to find the precursor ions for some of the scans and they aren't at the indicated m/z. How were the searches executed in MaxQuant? I know X-Tandem and Comet were used for Neucode searches by having really wide precursor tolerances that may result in a mis-reporting of the precursor ion. Also, do you have an uncentroided mzml file -- maybe the centroiding algorithm in proteowizard is not handling the neucode overlap well.

rpm78 commented 5 years ago

The experiment is duplex neucode SILAC. One condition is drug-stimulated, one is unstimulated. Neurons in different dishes were transferred into medium containing either 13C6-15N2-labelled lysine or d8-lysine shortly after plating. During purification, the samples were combined. So, there is no unlabeled condition. That said, there are likely unlabeled proteins (eg histones) present.

I believe the evidence file I included contained data from three pairs of raw files, of which I uploaded two mzML files (two technical replicates). In MaxQuant (1.6.5.0) the samples were run using the neucode functionality with the duplex-asterisk condition chosen. To make things simpler, I am doing a run of just two technical replicates and will send a link when they are done, including the mqpar file and the raws.

You are probably right to worry about centroiding of the MzMLs. Any differences in the thermo fusion lumos format might also pose difficulty for pwiz For MaxQuant, I think the second pass search uses a tolerance of 4.5 ppm for MS1.

I will also include mzML files that are centroided with the "use manufacturer's centroiding" setting in pwiz as well as mzMLs files that are not centroided.

Chris7 commented 5 years ago

Looking through the file, the relation between the m/z and MS/MS-mz doesn't make sense to me. For instance: AAEPDQNPTAVEGLGTEPDNLVITWKPLNGFQSNGPGLQYK has a m/z of 1456.05906 and a MS/MS-mz of 1462.06982421875 and a charge of 3. The m/z is supposed to be where an unlabeled variant of the peptide is located. However, 1462.06982421875+8.0142*2/3 does not add up to 1456.05906.

It might just be because I'm tired on a Sunday, but can you explain this? The reason it's important is PyQuant uses the m/z column and then adds the indicated labels (K602/K080) and looks where those ions are predicted to be. Given the math above not adding up, I can't help but think MQ changed something since I last used it.

rpm78 commented 5 years ago

I am going to run the two files through MaxQuant LFQ and see if the m/z and ms/ms-mz come out differently in comparison to the neucode-mode. At least that will be one more point of information.

Would there be a good set of parameters to use in order to run the files through Comet or XTandem in preparation for pyQuant?

On Sun, May 5, 2019 at 7:37 PM Chris Mitchell notifications@github.com wrote:

Looking through the file, the relation between the m/z and MS/MS-mz doesn't make sense to me. For instance: AAEPDQNPTAVEGLGTEPDNLVITWKPLNGFQSNGPGLQYK has a m/z of 1456.05906 and a MS/MS-mz of 1462.06982421875 and a charge of 3. The m/z is supposed to be where an unlabeled variant of the peptide is located. However, 1462.06982421875+8.0142*2/3 does not add up to 1456.05906.

It might just be because I'm tired on a Sunday, but can you explain this? The reason it's important is PyQuant uses the m/z column and then adds the indicated labels (K602/K080) and looks where those ions are predicted to be. Given the math above not adding up, I can't help but think MQ changed something since I last used it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Chris7/pyquant/issues/33#issuecomment-489473387, or mute the thread https://github.com/notifications/unsubscribe-auth/ALISHWO6DX26T5D5IO3QMILPT5VSJANCNFSM4HAQTCLA .

Chris7 commented 5 years ago

Thanks, looking back at my old maxquant data, the m/z value definitely seems to correspond to the unlabeled variant of the peptide. They've also seemed to drop the "Labeling State" column since.

For Comet/XTandem, I can see two things working:

When I last ran neucode data (I've moved to industry since and we don't use neucode atm), I used XTandem with 3 state labels (unlabeled was included) and searched for the unlabeled variants. This was for development data so the loss of PSMs from labeled-only peptides didn't really matter as much. I'm surprised the state of searching for neucode data is still like it is.

rpm78 commented 5 years ago

Thanks for the Comet/XTandem suggestions, it is a big help to hear how someone else went about it. I haven't tried bullet two and will give that a shot.

Your comment about the mass made me worry, but I think it's ok. Here is what I found.

The entry in the table.pdf file that MQ puts out in the txt folder describes the "MS/MS m/z" column in the evidence.txt file as "The m/z used for fragmentation (not necessarily the mono-isotopic m/z)."

In response to a similar discussion on researchgate, Malcolm Anderson of Waters
provides a helpful answer:

I would expect that the precursor m/z is the peptide's accurate mass, 
whereas the ms/ms m/z is derived from the isotope that was selected for fragmentation. 
For larger, multiply-charged peptides, often it's not the lowest isotope that's most abundant - 
but the most abundant isotope will usually be selected for fragmentation. 
From the values you've quoted, it could be a triply-charged peptide, 
and the second isotope peak was chosen for fragmentation.

This describes our case well.
Following the researchgate discussion by Anderson and Sigismondo above, I will write it out for reference and my own arithmetic improvement.

The "MS/MS m/z" entry (minus the contribution of the heavy SILAC lysines) is not equal to the "m/z" entry.

The extra mass of 13C6-15N2-lysine is 8.014199. This value divided by the charge (+3) of the ion fragmented is 2.6713399. Since there are two lysines (one missed cleavage) we multiply by 2 to obtain an extra m/z of 5.3427993. We subtract this from the "MS/MS m/z" (1462.06982421875) and get 1456.7264428875, which differs from the "m/z" table value.

How much are we off? 1456.7264428875 - 1456.05906 = .66728288. But this is m/z and we'd like to know how much mass we are off by: 0.66728288 x 3 = 2.00215. As Anderson points out above, this is consistent with an isotopic peak (+2 neutrons here) having been selected for fragmentation.


Anyway, I will send the link for the mzMLs later tonight.

Chris7 commented 5 years ago

Thanks for that information. It makes sense to me, I wish there was an indication that m/z was actually the MQ m/z correction process.

The problem is that MaxQuant doesn't give us a way to get the monoisotopic mass, as there is no table of modifications and their mass values like we can do with a proteome discover MSF file or xtandem/comet output. The alternative is encoding maxquant specific logic in pyquant (mapping Acetyl (Protein N-term) | Oxidation (M) to their respective chemical compositions) , which I didn't really want to do b/c I would have to keep up with their change cycle and similarly, predict every modification that someone may ever want to try. Do you know if MQ outputs this information anywhere nowadays? I recall a combined folder that had a bunch of misc. information.

rpm78 commented 5 years ago

I took a look in the tables.pdf file and the entry for m/z in evidence.txt is

The recalibrated mass-over-charge value of the precursor ion.

So that's good.

Yes, the txt folder is a subdirectory of combined, which also contains a "modified peptides" file.
I will include a link to the table.pdf file which has a description for every column in every txt file outputted.

rpm78 commented 4 years ago

Closing this as out of date. Thank you for putting in all the work to keep it current.

Chris7 commented 4 years ago

Thanks for your feedback as well