compomics / moFF

A modest Feature Finder (moFF) to extract MS1 intensities from Thermo raw file
Apache License 2.0
33 stars 11 forks source link

PeptideShaker moFF interaction via CLI #2

Closed bgruening closed 3 months ago

bgruening commented 8 years ago

As mentioned here in https://github.com/compomics/searchgui/issues/107 it would be great if we can pass PeptideShaker outputs to moFF via CLI. We could either pass the zip file or a special PS output could work to only pass the tabular file.

Thanks, Bjoern

Maux82 commented 8 years ago

Hi Bjoen, Actually the integration of moFF with the peptides shaker output is supported only in GUI version , where from the cpsx file we generate the tab file for moFF. The transformation is done in java inside the gui. If we can get the tab file from peptides shaker CLI , it would be great.

Another point for this integration on the CLI version is how the tab file name are associated with correct raw file. Actually the raw file should have the same name of input tab file. In the gui version we manage this mapping with a mapping file generated by the user.

bgruening commented 8 years ago

@Maux82 who do I need to peeve to get this into PS ;) @mvaudel?

What about adding two --input_tab | --input_raw with ngars and rely on the order of files. This makes it more flexible than the pure name.

mvaudel commented 8 years ago

Well in the interest of the simplicity of command lines and maintenance effort it sounds better to make the conversion code GUI independent, and have a command line taking the cpsx (or even better mzIdentML) as input?

Agree with @bgruening that it would be best to have --input_mzid | --input_raw for each doublet id/raw :)

bgruening commented 8 years ago

mzIdentML!!!! please! :) You guys rock!

kverhegg commented 8 years ago

Hello !

I've looked into this and I think the easiest (short-term) solution would be to write a wrapper around the Pladipus Step inside MoFF GUI. It basically (for now) handles the very simple conversion of a default peptideshaker report (Extended_PSM_report) into the format MoFF needs. This would be a command line version of the components leading to the MoFF process.

Long term...We might be able to provide mzIdentML support, but we would have to make a small parser to convert it into the MoFF format. We can potentially do this with the ms-data-core api Yasset Piverol made, I can recommend it !

Let me know your opinion before I get started ;)

Cheers,

Kenneth

bgruening commented 8 years ago

@kverhegg what are the dependencies of the first approach? Can this functionality striped out or do we need Pladipus completely? I think a self-contained tool to convert PS to MoFF is a nice idea to start with.

kverhegg commented 8 years ago

You wouldn't need the entire Pladipus Framework for sure, it would just use the command launching code. I guess I can strip everything out as well if that's better... But then isn't it easier to make a custom PS report tailored for MoFF ? As that is in essence what this step is doing...

EDIT

In fact the simplest way is to use the PeptideShaker command line options -reports 8 (should be the Extended_PSM_Report if I'm not mistaking) and feed those exports into MoFF, as I just heard Andrea provided the option in the command line to use lists of identification / raw files rather than a mapping file ! :) Meanwhile I'll check if we can make an mzID converter/adapter (should be fairly fast to do).

hbarsnes commented 8 years ago

Hi all,

@bgruening I've discussed a bit with Kenneth and it seems like you should be able to link PeptideShaker and moFF by using the default PSM report from PeptideShaker. This should generate the tsv file required (you may have to change the extension from txt to tsv though), i.e. the file called "MS2 identified peptide information" detailed here: https://github.com/compomics/moFF#input-data.

Note that at the moment you will manually have to rename some of the column headers to match the headers required by moFF:

But we will look into if it's possible to update the mappings used in moFF to also support the column headers used in the PeptideShaker output. @Maux82 This should not be too difficult I hope?

Regarding the mzid support, Kenneth is working on a converter to convert mzid files to the required tsv file. I'll leave it to him to give you a status update for this part. :)

Also, note that none of the above is actually tested yet, so looking forward to your testing. ;)

Best regards, Harald

mvaudel commented 8 years ago

Hi guys,

You might want to try the compomics-utilities parser of mzID, can get you this piece of information fairly easily, might be quicker than trying to accommodate the PSM text export (which is not a standard). On the other hand rt is not necessarily part of the export of ID software. I strongly encourage you to use a unique spectrum identifier and map back to the rt and mz in the raw file instead of relying on exports of id tools.

Good luck :)

Marc

hbarsnes commented 8 years ago

Hi Marc,

You might want to try the compomics-utilities parser of mzID, can get you this piece of information fairly easily, might be quicker than trying to accommodate the PSM text export (which is not a standard).

As far as I understood the problem with this approach is that it would require adding an additional step in between the ID tool's mzid export and the reading into moFF, as moFF is written in Python while our mzid parsing code is in Java. (And the same for the EBI mzid parser.)

So while I agree that supporting mzid directly on the moFF command line would be the optimal solution I'm not sure if this is really feasible at this point.

I strongly encourage you to use a unique spectrum identifier and map back to the rt and mz in the raw file instead of relying on exports of id tools.

I agree with this one as well. Also, maybe some sort of score for the PSMs should be part of the input to moFF? As you can have both good and bad scoring PSMs? And at the moment all PSMs seem to be treated as equally good? Not sure how easy it would be to do this in a generic way though?

Best regards, Harald

Maux82 commented 8 years ago

Hi All,

Note that at the moment you will manually have to rename some of the column headers to match the headers required by moFF:

Sequence > peptide
Protein(s) > prot
RT > rt (if the case matters, not sure?)
m/z > mz
Theoretical Mass > mass
Identification Charge > charge

But we will look into if it's possible to update the mappings used in moFF to also support the column headers used in the PeptideShaker output

Well, moFF is mainly based on pandas dataframe and the name of the fields are also case sensitive. I m going to use a mapping function that change the PS field name into the requested by for moFF. I can use the input--input_mzid | --input_raw just for PS output file. The other input ways will stay for all the other manually curated output. What do you think guys ?

I strongly encourage you to use a unique spectrum identifier and map back to the rt and mz in the raw file instead of relying on exports of id tools.

I also agree with you that retrieve the rt and the mz in the raw file is one of the best solution to have precises RT values in moFF. Actually, we are testing a new Thermo Library (cross platform) that should do this operation very quick. Implement the same thing with the unthermo library ( txic and txic.exe are compiled using the unthermo library ) requires more effort. At the moment I propose to implement this feature as future works.

I agree with this one as well. Also, maybe some sort of score for the PSMs should be part of the input to moFF? As you can have both good and bad scoring PSMs? And at the moment all PSMs seem to be treated as equally good? Not sure how easy it would be to do this in a generic way though?

Yes , the PSM score can be an input of moFF. if you need all this kind of post-processing on the PS output maybe we have to think about to a moFF CLI version just for PS .

mvaudel commented 8 years ago

This sounds all great! Indeed having a PSM cut-off is important. If you support mzId there will be a field indicating whether the PSM is validated or not, that would make it generic and remove the need for PS command line :)

hbarsnes commented 8 years ago

Well, moFF is mainly based on pandas dataframe and the name of the fields are also case sensitive. I m going to use a mapping function that change the PS field name into the requested by for moFF. I can use the input --input_mzid | --input_raw just for PS output file. The other input ways will stay for all the other manually curated output. What do you think guys ?

If that means that we don't have to change our PeptideShaker tsv export then that sounds like a good solution until mzid can be supported. But I'm not sure how you would use the "--input_mzid" option for the PeptideShaker tsv export? I thought that supporting mzid parsing directly in moFF was problematic?

Yes , the PSM score can be an input of moFF. if you need all this kind of post-processing on the PS output maybe we have to think about to a moFF CLI version just for PS .

If possible I would try to make this generic and not specific to PeptideShaker. As I'm pretty sure the scores will be relevant for most use cases and input options?

bgruening commented 8 years ago

I must admit that I'm a little bit lost with all the details but I really appreciate all your effort and looking forward to test things and integrate it more deeply into Galaxy :)

One additional note/question from our side. Currently we tried to convert the raw file as soon as possible into some open standard format (using mzconvert) and never touched the raw file again. I had the impression that dealing with raw is complicated and non-free. Does this world-view now belongs to the past? Is it really needed to pass the raw file through the entire analysis?

Thanks!

Thys3Potgieter commented 8 years ago

R has a nice mzid parser library (MSnID) that should be easy to wrap in a python function using rpy2...if I have some time ill see if I can get the psm table into pandas and let you guys know...Thanks for the new tool!! :)

bgruening commented 8 years ago

@Thys3Potgieter is this one not good enough? https://pypi.python.org/pypi/pyteomics

Thys3Potgieter commented 8 years ago

@bgruening Thanks! I will check it out! You just saved me from some work it looks like :)

kverhegg commented 8 years ago

Just a small update, I've started on writing a converter for the mzid format to the MoFF input format. I didn't test it fully yet, but it might help you ! :) The class has a main, so you could just lift the code if necessary ;)

Pladipus - mzid - converter class

Stortebecker commented 7 years ago

@kverhegg @Thys3Potgieter Just wanted to ask if the conversion tool from mzid to MoFF has been tested now. Is it ready to use?

bgruening commented 7 years ago

First of all congrats for the publication! I would like to get this now ready for our Galaxy users, is there any process regarding this issue? Anything recommended way to pass data from one tool to the other?

Maux82 commented 7 years ago

Hi @bgruening ,

Sorry for my late replay.

From my side this is the situation: in moFF.all I have added two other options that allow the user to specify the list of input file and the corrisponding list of raw file: python moff_all.py --inputtsv f1_folder/input_file1.txt f1_folder/input_file2.txt --inputraw f1_folder/input_file1.raw f1_folder/input_file2.raw --output_folder output_moff

in the moFF.py, I have added an option to run the apex module just point out the name of the input file and the its raw file: python moff.mbr --inputtsv f1_folder/20080311_CPTAC6_07_6A005.txt --inputraw f1_folder/20080311_CPTAC6_07_6A005.raw --tol 1O --output_folder output_moff

About passing data from PeptideShaker to moFF, I propose to export the result using one of the default export (at moment I do not remembar the name by heart) and then I can change the field name in order to let them work in moFF. This pre pocessing should be done inside moFF without using any other extra sw.

Wha do you think @mvaudel and @hbarsnes ?

I have also another question the rt exported and used in Peptide Shaker is always in second or not ?

mvaudel commented 7 years ago

Hi,

Great to see this moving forward!

I have also another question the rt exported and used in Peptide Shaker is always in second or not ?

By default there is no RT in PeptideShaker, unless provided in the mgf file. Then the unit is the one provided in the mgf file. I would not rely on this metric, but rather do the following: 1- Use the MS2 spectrum identifier to find it back in the raw file 2- Use the coordinates of the corresponding precursor to find the feature.

Does it make sense?

Hope this helps!

Marc

bgruening commented 7 years ago

@Maux82 sounds great to me! Thanks for working on this!!!!

hbarsnes commented 7 years ago

About passing data from PeptideShaker to moFF, I propose to export the result using one of the default export (at moment I do not remembar the name by heart) and then I can change the field name in order to let them work in moFF. This pre pocessing should be done inside moFF without using any other extra sw.

Sounds good to me. :)

Maux82 commented 7 years ago

The latest version of moFF can takes in input PS exported file using the PSM default exporter. I have tested this new function in windows but not completely in Linux but I do not expect so much problems.

@mvaudel , @bgruening : I assume that the default PSM export in PS contains those fields: ['row index','Protein(s)','Sequence','Variable Modifications','Fixed Modifications','Spectrum File','Spectrum Title','Spectrum Scan Number','RT','m/z','Measured Charge','Identification Charge','Theoretical Mass','Isotope Number','Precursor m/z, Error [ppm]','Localization Confidence','Probabilistic PTM score','D-score','Confidence [%]','Validation']

Is it correct ? row index in the export is a default option right ?

bgruening commented 7 years ago

@Maux82 great news. Do you plan any further updates the next week? if not I would process and create conda packages and Docker containers for it.

Maux82 commented 7 years ago

The next big update planned is the integration with the new Thermo multi-platform library for the raw file. I expect that this will happen in two/three weeks. Maybe we can wait a bit if @mvaudel or @hbarsnes have comment or they want to test more estensively the integration with PS.

mvaudel commented 7 years ago

@Maux82 Great, thanks, that was fast! The default PSM export is something static so the one you used should stay the same. Alternatively, we can create a dedicated MOFF export and even make a MOFF command line option, just tell us what fields you need :)

Maux82 commented 7 years ago

@mvaudel : The standard PSM export is fine so far, eventually if something change I stored the field names on the properties file so it is easy to change in the future.

Maux82 commented 7 years ago

@bgruening The current new version on master repository is also multi-threading. I have tested a lot and it sounds pretty stable. On the multipr_rawfile branch thereis a version of moFF based on the new rawfilereader library from Thermo. This is still multi-platform but I have still some issue to fix. The latter one to work on linux needs the mono to run on linux !