MzIdentML Converter Modifications

sureshhewabi commented 1 month ago

We need to make this library supported by command-line options for each functionality including:

1. Validation of crosslinking mzIdentML (mzID) files. https://github.com/PRIDE-Archive/xi-mzidentml-converter/issues/78 2. Command-line support https://github.com/PRIDE-Archive/xi-mzidentml-converter/issues/80 3. Data generation for PDBDev reports. https://github.com/PRIDE-Archive/xi-mzidentml-converter/issues/79

colin-combe commented 1 month ago

Hi @sureshhewabi, @aozalevsky, @ypriverol

we already have command line support in https://github.com/PRIDE-Archive/xi-mzidentml-converter/blob/python3/parser/process_dataset.py using the standard python library argparse.

Perhaps click is better, it has a section in the documentation about why it is not based on argparse.

Anyway, there are multiple solutions for making the command line interface. We can use click if it seems the best.

I think validation will consist mainly of running the parser and seeing if it works or not. But it will need to be modified so it doesn't try to write stuff anywhere. Also, we can improve its error messages so we know why it failed.

Lets think more about 'Data Generation for PDBDev reports':

what information do you want to get back and in what format?
how do you want to call it? - is this a case of using it like a library, i.e. a dependency of the code that generates the reports, rather than calling it on command line? (Is the code that generates the reports in python?)

cheers, C

colin-combe commented 1 month ago

Also, IMP may have a need to extract crosslinking data from mzIdentML files? This might be related?

FYI, our converter is based on the pyteomics library. It adds a way of getting crosslink info from whats returned from pyteomics, it's not a 'from scratch' implementation of mzIdentML parsing.

aozalevsky commented 1 month ago

@colin-combe Ideally, i'd like to get an output similar to the current API output. Basically, we need sequences (some ID + sequence) + residues pairs. Keeping the JSON formatted output would be nice, too.

Calling (import + call) as a library would be ideal, but making a subprocess CLI call is also acceptable.

colin-combe commented 1 month ago

Calling (import + call) as a library would be ideal

yes, i think that's better. And you were totally right with what you said in meeting about there being several benefits to it being like this (not just a way of addressing the private submission questions). It was never deliberately not a library.

Anyway, i'll take a look at this next week, cheers, C

ypriverol commented 1 month ago

Validation is the priority, and then the data structure and the JSON for PDBDev reports. We have to test the validation in the command line and create some documentation for users who want to start testing their dataset files. @sureshhewabi probably would be good to have an issue alone and link to this one.

sureshhewabi commented 1 month ago

Thanks everyone. As we discussed on the meeting yesterday, let's create separate issues for separate task and then delegate task among us. Also we can keep this as the main Issue that link other task so we can track the progress.

aozalevsky commented 1 month ago

Also, IMP may have a need to extract crosslinking data from mzIdentML files? This might be related?

I had a chat with Ben, the main IMP developer in our lab. He agreed it would be a neat addition to the current functionality (dealing with csv/xls lists).

colin-combe commented 1 month ago

i updated https://github.com/PRIDE-Archive/xi-mzidentml-converter/issues/79 and https://github.com/PRIDE-Archive/xi-mzidentml-converter/issues/78 to reflect status of version in PR https://github.com/PRIDE-Archive/xi-mzidentml-converter/pull/84

any comments on how to better organise/structure the main process_dataset.py file are welcome. (or just general python style stuff)

PRIDE-Archive / xi-mzidentml-converter

MzIdentML Converter Modifications #77