Roestlab / dia-pasef

Scripts for DIA-Pasef
5 stars 12 forks source link

Feature/targeed data extraction #35

Closed singjc closed 2 years ago

singjc commented 2 years ago

Changes

Example

diapysef --help
Found Bruker sdk. Access to the raw data is possible. 

Usage: diapysef [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

  Mobi-DIK (Ion Mobility DIA Tool-Kit) is a package for analysis of DIA data
  coupled to ion mobility.

  Visit http://openswath.org/en/latest/docs/mobi-dik.html for usage
  instructions and help

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  converttdftomzml     Conversion program to convert a Bruker TIMS .d...
  export               Export a reduced targeted mzML file to a tsv file
  prepare-coordinates  Generate peptide coordinates for targeted...
  report               Generate a report for a specfific type of plot
  targeted-extraction  Extract from the raw data given a set of target...

Conversion of TDF to mzML

Conversion from TDF to mzML is added to the click CLI, functionality remains the same. ``` diapysef converttdftomzml --help Found Bruker sdk. Access to the raw data is possible. Usage: diapysef converttdftomzml [OPTIONS] Conversion program to convert a Bruker TIMS .d data file to mzML format Options: --in PATH The location of the directory containing raw data (usually .d). [required] --out TEXT The name of the output file (mzML). [required] --merge INTEGER Number of consecutive frames to sum up (squash). This is useful to boost S/N if exactly repeated frames are measured. [default: -1] --keep_frames / --no-keep_frames Whether to store frames exactly as measured or split them into individual spectra by precursor isolation window (default is to split them - this is almost always what you want). [default: no-keep_frames] --verbose INTEGER Verbosity. [default: -1] --overlap INTEGER How many overlapping windows were recorded for the same m/z window - will split the output into N output files. [default: -1] --framerange TEXT The minimum and maximum Frames to convert. Useful to only convert a part of a file. [default: [-1, -1]] --help Show this message and exit. ```

Targeted Data Extraction

Generating peptide coordinates for targeted raw data extraction

**NOTE**: We specify run_id if we pass a merged osw, so that we get coordinates for a specific run with targeted RT and IM identification coordinates ``` python -m diapysef.main prepare-coordinates --in merged.osw --out peptides_coord_ex.pkl --run_id 5500589384113116496 --target_peptides '["T(UniMod:21)ELISVSEVHPSR", "TELIS(UniMod:21)VSEVHPSR"]' Bruker sdk not found. Some functionalities that need access to raw data will not be available. To activate that functionality place libtimsdata.so (Linux) or timsdata.dll in the src folder. [2022-09-06 11:16:26] INFO: Generating coordinates... [2022-09-06 11:16:26] INFO: Finished generating coordinates! ``` If you want to manually generate a peptide coordinate dictionary, it should look something like the following: ``` peptides = { 'T(UniMod:21)ELISVSEVHPSR_2': { 'peptide': 'T(UniMod:21)ELISVSEVHPSR', 'precursor_mz': 767.3691, 'charge': 2, 'rt_apex': 1730.08, 'im_apex': 1.026132868499893, 'qvalue': 0.0, 'product_mz': [496.2627, 811.4057, 910.4741, 997.5061, 1110.5902, 1223.6743], 'product_charge': [1, 1, 1, 1, 1, 1], 'product_annotation': ['y4^1', 'y7^1', 'y8^1', 'y9^1', 'y10^1', 'y11^1'], 'product_detecting': [1, 1, 1, 1, 1, 1], 'rt_boundaries': [1718.036865234375, 1751.983642578125]}, 'TELIS(UniMod:21)VSEVHPSR_2': { 'peptide': 'TELIS(UniMod:21)VSEVHPSR', 'precursor_mz': 767.3691, 'charge': 2, 'rt_apex': 1905.32, 'im_apex': 1.018710764387254, 'qvalue': 5.231105591576423e-08, 'product_mz': [344.1816, 359.2037, 724.3737, 811.4057, 910.4741, 1077.4725], 'product_charge': [1, 1, 1, 1, 1, 1], 'product_annotation': ['b3^1', 'y3^1', 'y6^1', 'y7^1', 'y8^1', 'y9^1'], 'product_detecting': [1, 1, 1, 1, 1, 1], 'rt_boundaries': [1889.531494140625, 1918.104248046875]}, 'TELIS(UniMod:21)VSEVHPSR_3': { 'peptide': 'TELIS(UniMod:21)VSEVHPSR', 'precursor_mz': 511.9151, 'charge': 3, 'rt_apex': 1932.09, 'im_apex': 0.819074213225268, 'qvalue': 0.013292880776271469, 'product_mz': [359.2037, 496.2627, 595.3311, 811.4057, 1077.4725, 1303.6406], 'product_charge': [1, 1, 1, 1, 1, 1], 'product_annotation': ['y3^1', 'y4^1', 'y5^1', 'y7^1', 'y9^1', 'y11^1'], 'product_detecting': [1, 1, 1, 1, 1, 1], 'rt_boundaries': [1917.89404296875, 1953.622314453125]} } ``` **Note:** we separate peptide precursors Not all the values are necessary, however, the ones that are needed are: **peptide**, **precursor_mz**, **charge**, **rt_apex**, **im_apex**, **product_mz**. You can save the dictionary to a pickle file using the following code: ``` import pickle with open(f"peptides.pkl", "wb") as output_file: pickle.dump(peptides, file=output_file, ) ```

Targeted Extraction of the Raw diaPASEF mzML data

To reduce the raw diaPASEF data, for visualization or for preliminary algorithm development, you can perform a targeted extraction of the data to reduce the data given target coordinates ``` python -m diapysef.main targeted-extraction --in IPP_M10_DIA-PaSEF_60min_Bruker10_400nL_1ul-inj-redo2_Slot2-25_1_2151.mzML --coords peptides_coord_ex.pkl --verbose 0 --mslevel [1,2] --mz_tol 20 --rt_window 50 Bruker sdk not found. Some functionalities that need access to raw data will not be available. To activate that functionality place libtimsdata.so (Linux) or timsdata.dll in the src folder. [2022-09-07 12:07:19] INFO: Loading data... [2022-09-07 12:07:46] INFO: Reducing spectra using targeted coordinates... INFO: Processing..TELIS(UniMod:21)VSEVHPSR_3: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:13<00:00, 24.35s/it] [2022-09-07 12:09:00] INFO: Finished extracting targeted spectra! ```

Exporting reduced targeted mzML for easier data manipulation and plotting

We can export the reduced mzML to a tsv file with m/z, retention time, ion mobility and intensity data as along tsv file ``` python -m diapysef.main export --in targed_data_extraction.mzML --out extracted_data.tsv --mslevel [1,2] --verbose 10 --log_file export.log Bruker sdk not found. Some functionalities that need access to raw data will not be available. To activate that functionality place libtimsdata.so (Linux) or timsdata.dll in the src folder. [2022-09-07 12:09:34] INFO: Loading data... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5460/5460 [00:00<00:00, 6816.65it/s] [2022-09-07 12:09:36] INFO: Finished exporting data! ```

Generating a report of RT and IM Heatmap plots

We can generate a 2D heatmap of the data using the report module. The current implementation is a quick plotting implementation in matplotlib. You could use other plotting tools libraries to make nice plots if you want, using the extracted_data.tsv file. ``` python -m diapysef.main report --in extracted_data.tsv --out diapasef_report_rt_im.pdf Bruker sdk not found. Some functionalities that need access to raw data will not be available. To activate that functionality place libtimsdata.so (Linux) or timsdata.dll in the src folder. [2022-09-07 12:13:14] INFO: Generating a report of plots for a Retention Time and Ion Mobility Heatmaps... INFO: Plotting..TELIS(UniMod:21)VSEVHPSR_3: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.16it/s] [2022-09-07 12:13:16] INFO: Finished generating report! ``` ![image](https://user-images.githubusercontent.com/32938975/189409072-4b3e8c90-55b5-4a86-a8e0-11f7d76e6719.png)

Docker image available

I have a docker image available on docker hub singjust/modibik

Using the image

``` $ docker run -it --rm -v $PWD:/data/ singjust/mobidik:latest diapysef --help Found Bruker sdk. Access to the raw data is possible. Usage: diapysef [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]... Mobi-DIK (Ion Mobility DIA Tool-Kit) is a package for analysis of DIA data coupled to ion mobility. Visit http://openswath.org/en/latest/docs/mobi-dik.html for usage instructions and help Options: --version Show the version and exit. --help Show this message and exit. Commands: converttdftomzml Conversion program to convert a Bruker TIMS .d... export Export a reduced targeted mzML file to a tsv file prepare-coordinates Generate peptide coordinates for targeted... report Generate a report for a specfific type of plot targeted-extraction Extract from the raw data given a set of target... ```

Singularity

I tried converting the docker image to a singularity image, which generates a singularity image successfully. However, there is some downstream/incompatible issue where the singularity image fails to find pyOpenMS libraries. ``` singularity build mobidik.simg docker://singjust/mobidik:latest Docker image path: index.docker.io/singjust/mobidik:latest Cache folder set to /home/justincsing/.singularity/docker [8/8] |===================================| 100.0% Importing: base Singularity environment Exploding layer: sha256:1339eaac5b67d16d6d9f41fb7a7b96f7cebf3ba4beab36cbb60935aa772af583.tar.gz Exploding layer: sha256:4c78fa1b97999d08408734a61040475ade5bd7e33e91c0d5170dba2c7c7a92fd.tar.gz Exploding layer: sha256:14f0d2bd524377dc42d072443c0e5e7cafa14f5df609d39bb1f717f43817a2cd.tar.gz Exploding layer: sha256:76e5964a957d206950c8c0de99f3c491ecec78887ebe4df0ac5ab9ceb536a4d5.tar.gz Exploding layer: sha256:cc4bb1a04a94a9015f79b0d36ee942b63bd486da0ef79689d4326398b561fa3a.tar.gz Exploding layer: sha256:3dee34a94cb0bff32985a098419d9017f8eb56b546004d176ae6fa8d247615ea.tar.gz Exploding layer: sha256:f8a5b50326b34ceb71d95c269c0e7fb60eee4b83f8c3ffb57305d903f5779a37.tar.gz Exploding layer: sha256:b175f117d5f818879403986af84146710f0f9a7127e3aba399fed6377d87970a.tar.gz Exploding layer: sha256:360ea8bc76bec7e00e450f397ef44e29ed1128cbb76ecb152bab39a48aa52f36.tar.gz Exploding layer: sha256:6b10c3510b977304b24bcec7eb6aaa8d118de9090da6052914b9785886ec0d24.tar.gz Exploding layer: sha256:6a68964267807195ee705bc494af39ecb0e13c2db2bfb433cded056ebaf087e6.tar.gz Exploding layer: sha256:139ad97f2110cc434c5f5e90fb1a4e9db8555ab04169be2b76362d547d3b9bc0.tar.gz Exploding layer: sha256:c8c8546786800e4efdb0586595135930b6454baf9547e6b7bdbcff83267ddcd1.tar.gz Exploding layer: sha256:f46e0f4c3a97c95c12236beaceb94be8135e74a2eb517ef26065637d29319dcc.tar.gz Exploding layer: sha256:139869ec766b147c8ffd246b5f4c85c0f8751b24533de55d64da9abced0666a2.tar.gz Exploding layer: sha256:502356b82cee79aaacbf999cb24ed5987f153f1f5afd1df7583ad241cda95819.tar.gz Exploding layer: sha256:bf9d320aa0550086e300f3f368e192836fa09325d879aae5d406b7385c3c8c14.tar.gz Exploding layer: sha256:f1fb06b60008d81507bd6e099fecbceef2c9245a5a0303f5ae5567513772a59c.tar.gz WARNING: Building container as an unprivileged user. If you run this container as root WARNING: it may be missing some functionality. Building Singularity image... Singularity container built: mobidik.simg Cleaning up... ```

Some issues I ran into using pyopenms during implementation

Error pickling MSSpectrum

If we try to implement paralleization, we need to fix the pickling issue below: ``` >>> spec >>> try: ... with open("test_spec.pkl", "wb") as file: pickle.dump(spec, file) ... except: ... traceback.print_exc(file=sys.stdout) ... Traceback (most recent call last): File "", line 2, in File "stringsource", line 2, in pyopenms.pyopenms_1.MSSpectrum.__reduce_cython__ TypeError: self.inst cannot be converted to a Python object for pickling ``` There actually seems to be a helper for pickle, but then why is it still failing? ``` >>> help(spec.__reduce_ex__ ... ) Help on built-in function __reduce_ex__: __reduce_ex__(...) method of pyopenms.pyopenms_1.MSSpectrum instance helper for pickle ```

Error setting Floating Data Arrays

There seems to be an issue when setting the Floating Data Arrays. The resulting set FloatingDataArray does not match the original input floating data array for a few values ``` >>> fda = po.FloatDataArray() >>> filtered_im_np = np.array(filtered_im).astype(np.float32) >>> fda.set_data(filtered_im_np) >>> fda.setName("Ion Mobility") >>> # TODO: There currently is an issue when setting float data array. Getting Float Data Array does not match input >>> spec.setFloatDataArrays([fda]) >>> fda.get_data() array([1.0196674 , 1.0264589 , 1.0287174 , 0.9981177 , 0.9992461 , 1.0003752 , 1.0015049 , 1.0026352 , 1.0037968 , 1.0049285 , 1.0071937 , 1.0094614 , 1.0105963 , 1.0117317 , 1.0140047 , 1.0151113 , 1.0162494 , 1.0173881 , 1.0185274 , 1.0230603 , 1.0298631 , 1.0389278 , 0.99698985, 1.0060607 , 1.0083272 , 1.0128679 , 1.025315 , 1.0355394 , 1.0378081 ], dtype=float32) >>> spec.getFloatDataArrays()[0].get_data() array([-2.3517542e-12, 4.5780421e-41, -3.5266161e-01, 3.0775317e-41, 9.9924612e-01, 1.0003752e+00, 1.0015049e+00, 1.0026352e+00, 1.0037968e+00, 1.0049285e+00, 1.0071937e+00, 1.0094614e+00, 1.0105963e+00, 1.0117317e+00, 1.0140047e+00, 1.0151113e+00, 1.0162494e+00, 1.0173881e+00, 1.0185274e+00, 1.0230603e+00, 1.0298631e+00, 1.0389278e+00, 9.9698985e-01, 1.0060607e+00, 1.0083272e+00, 1.0128679e+00, 1.0253150e+00, 1.0355394e+00, 1.0378081e+00], dtype=float32) ``` **Problem Solved** See [Issue 5862](https://github.com/OpenMS/OpenMS/issues/5862)
singjc commented 2 years ago

@jcharkow @alhigaylan can you guys try and test this out

jcharkow commented 2 years ago

I fixed version dependencies for me @singjc please let me know if this messes anything up on your end.

singjc commented 2 years ago

@jcharkow Great thanks for the fix! Seems to still work on my end in a new environment (python 3.9). I do run into issues using a clean environment using python 3.10, specifically with pyopenms installation version issues though. Which is a separate issue.

Thanks for the singularity definition!

singjc commented 2 years ago

I added the option of processing caches mzMLs that uses SpectrumAccessOpenMSCached OSSpectrum, but it actually seems slower (100 fold) than the OnDiskExperiment MSSpectrum. :thinking:

image

I compared the OnDiskExperiment to the SpectrumAccessOpenMSCached on only MS1 level data, MS1 filtered mzML for the former and an MS1 cached mzML for the latter. I tested using 54 precursors, and extract 1208 spectra. Each point in the plot represents the time in milliseconds (measured by pythons time.time()) it takes to extract the m/z, intensity and ion mobility arrays from the corresponding Spectrum object for a single spectrum.

However, the OnDiskExperiment does take longer to load the data and meta-data initially, 24.7840 sec, while the SpectrumAccessOpenMSCached only takes 0.2491 sec to load the data and meta-data.

Overall, they are pretty close in execution time for reducing the spectra for 54 precursors, 92.5028 sec for OnDisk and 102.7164 sec for Cache.

singjc commented 2 years ago

Cached spectra processing is now 10 fold faster than ondisk, using hroest/OpenMS/tree/feature/drift_time_os_spec_2

image

It also now only takes 21.0601 sec to reduce the spectra from cache, vs 74.5139 sec to reduce the spectra from ondisk

On Disk Experiment

diapysef targeted-extraction --in /media/justincsing/ExtraDrive1/Documents2/Roest_Lab/Github/PTMs_Project/synthetic_pool_timstoff/data/raw/IPP_M10_DIA-PaSEF_60min_Bruker10_400nL_1ul-inj-redo2_Slot2-25_1_2151_MS1.mzML --coords peptides.pkl --readOptions ondisk --verbose 1 --mslevel [1] --mz_tol 20 --rt_window 40 --im_window 0.08 Found Bruker sdk. Access to the raw data is possible.

[2022-09-30 13:18:20] INFO: Loading data... [2022-09-30 13:20:34] INFO: Reducing spectra using targeted coordinates... INFO: Processing..YVC(UniMod:4)EGPSHGGLPGAS(UniMod:21)SEK_3: 100%|███████████████████████████████████████████████████████████| 54/54 [01:14<00:00, 1.38s/it] [2022-09-30 13:21:48] INFO: Finished extracting targeted spectra!

Cached

diapysef targeted-extraction --in /media/justincsing/ExtraDrive1/Documents2/Roest_Lab/Github/PTMs_Project/synthetic_pool_timstoff/data/raw/cached/20220928_171403_179508ef404e_1_1_ms1.mzML --coords peptides.pkl --readOptions cached --verbose 1 --mslevel [1] --mz_tol 20 --rt_window 40 --im_window 0.08 Found Bruker sdk. Access to the raw data is possible.

[2022-09-30 13:41:11] INFO: Loading data... [2022-09-30 13:41:11] INFO: Reducing spectra using targeted coordinates... INFO: Processing..YVC(UniMod:4)EGPSHGGLPGAS(UniMod:21)SEK_3: 100%|███████████████████████████| 54/54 [00:21<00:00, 2.57it/s] [2022-09-30 13:41:32] INFO: Finished extracting targeted spectra!

singjc commented 2 years ago

@jcharkow I tested on windows, and everything works. Can we merge before changes get too large?

jcharkow commented 2 years ago

It looks like some of my comments may have included a bigger code block than intended but if you have any questions let me know

singjc commented 2 years ago

Great thanks for the comments, I addressed the comments and made the changes.

jcharkow commented 2 years ago

Looks good I think its ready to merge!