Jhsmit / PyHDX

Derive ΔG for single residues from HDX-MS data
http://pyhdx.readthedocs.io
MIT License
28 stars 12 forks source link

HDExaminer data support #349

Open Jhsmit opened 2 weeks ago

Jhsmit commented 2 weeks ago

PyHDX currently only directly accepts data formatted as 'state data' output from DynamX

The issue is a continuation of discussion opened by @tuttlelm at #348:

Related to coming from HDExaminer data (and I can open a separate issue for that topic if that would be more appropriate), pyHDX currently does not allow duplicate measurements when creating the HDXMeasurement object. As far as I can tell, having replicates isn't an issue for any of the downstream calculations, but I wondered if you had thoughts on that. I was able to make some simple modifications to models.py so that I can leave replicates in my data and not have to replicate average it first (basically just data.reset_index() in the init() function and add "index" as a column where you are sorting or pivoting on the columns)

It would be great to add support for other file formats such as HDExaminer data.

A couple of questions: Why would you prefer to leave the replicates in the data and not average them before entering the HDXMeasurment object? Do you want to perform downstream calculations on each replicate individually?

In the latter the case would it make sense to make one HDXMeasurment object per replicate?

Perhaps you could share your input script or make a pull request with your changes to models.py? To be honest I think that the current HDXMeasurement object has become a bit of a clumsy thing to work with at the moment. I'm planning to change it in the future (probably in the form of a different project altogether).

There is also the hdxms-datasets package, which is still in a beta phase. Maybe you can also share your thoughts on this. The idea there is that there is a datasets format with a .yaml specification example containing all required metadata such that downstream packages like PyHDX can load data from there directly. Ultimately, it would be nice to add support there for 1) cluster data (replicates) 2) HDExaminer output 3) other formats.

Again, also there currently only DynamX state data is supported, simply because thats the only example data I have at the moment. Do you have any example datasets of HDExaminer data you can share and/or example scripts of how you load the data?

Arthanis58 commented 2 weeks ago

Hello, I would also welcome HDexaminer support, however as I am not good with python I am keeping to the web GUI and it would be very helpful if I could input the exposure times in seconds. Is there any way to do it now with the batch .yaml file definitions ?

tuttlelm commented 2 weeks ago

I have created a pull request that includes the models.py changes and an additional script convert_data.py for the HDExaminer conversion.

One main reason for keeping replicates is that we use that type of data for other HDX-MS statistical analysis packages, so it is nice to be able to work with the same original data file for different applications. Leaving as replicates does tend to rather inflate the coverage plots, but I appreciate being able to see any replicate to replicate variability there (obviously there are other ways to do this as well). My preference is to keep the replicates within a single HDXMeasurement object. The per residue calculations take care of the replicate averaging.

Is the hdxms-datasets project for the raw data or just the analysis outputs? I'd be very interested in something that can translate between different raw data formats and meta data specifications. I have the opposite problem as you in my pyHXExpress project in that I have access only to HDExaminer outputs and not so much DynamX type data and outputs.

Currently all of the HDExaminer outputs I am working with are for unpublished projects, but I'll see if I can track down something I am able to share.

Jhsmit commented 1 week ago

With respect to hdxms-datasets at the moment the scope would be output in the form of peptide d-uptake tables. At the moment as a format where replicates are averages together, but preferable the format would support keeping the replicates and let downstream software decide how they treat replicates. This way statistical testing can still be done on the datasets.

The format doesnt have to be all the same, so there can be DynamX formatted peptide output data files, or HDExaminer formatted output files, as long as the metadata specifies which format it is, and then a reader function can take that metadata and read tables depending on which format was used.

Ideally also there should be some agreement between users on which fields the returned dataframes are; eg is it 'time' , 'exposure' or 'exposure_time' (and units); d-uptake, uptake; should there be a m0 field, etc