[Discussion] Hard coded mzml paths

jspaezp commented 1 year ago

Hey there!

Me again ... If I am understanding the code correctly, when reading the DIA-NN report, it tries to match the data in the experimental design and changes the extension to mzML. Since I am working on bypassing the conversion to mzML, that leads to a:

  Traceback (most recent call last):
    File "/usr/local/lib/python3.10/site-packages/multiqc/multiqc.py", line 654, in run
      output = mod()
    File "/usr/local/lib/python3.10/site-packages/pmultiqc/modules/quantms/quantms.py", line 164, in __init__
      self.parse_diann_report()
    File "/usr/local/lib/python3.10/site-packages/pmultiqc/modules/quantms/quantms.py", line 1705, in parse_diann_report
      self.cal_num_table_data[i] = {'sample_name': self.exp_design_table[i]['Sample'],
  KeyError: 'MYFILE_1_DIA.mzML'

What do you think about the idea of attempting to generate concensus names? (find common prefix if names dont match perfectly).

This would be the chunk of code: ... of many ... where the matching happens

https://github.com/bigbio/pmultiqc/blob/a25b2b9f26b1ba3c33d02cba6464a2dc7e02ed4f/pmultiqc/modules/quantms/quantms.py#L148-L149

I would voluteer to add that if there was data for the tests in the current state. I only see data for the dda workflow.

ypriverol commented 1 year ago

Great idea, what do you think @WangHong007 @daichengxin

WangHong007 commented 1 year ago

Hi @jspaezp. Do you just want the prefix and then add the right suffix in the corresponding place? We actually already have that prefixes in assay name column from SDRF file. BTW, what is the cause of your error message here? Is your file prefix or suffix wrong?

jspaezp commented 1 year ago

Hello @WangHong007 ! thanks for the reply! I look forward to collaborating with you!

I am working on a fork of the pipeline that bypasses the .mzml conversion to run DIANN, so the file format that DIANN recieves is a .d, making it such that MYFILE.mzML does not exist, because the table contains the file MYFILE.d. On the other hand, some QC does happen in a converted .mzML. THEREFORE, I am suggesting that if not all names match 1:1 between tables, there will be an attempt to find a common prefix between them. (thus matching "MYFILE" regardless of the extension)

Regarding your comment, I don't totally understand what you mean, If I go by the example in this page: https://github.com/bigbio/proteomics-sample-metadata/tree/master/sdrf-proteomics the assay name seems like an alias unrelated to the file name.

https://github.com/nf-core/quantms/issues/64 https://github.com/bigbio/quantms/pull/275

ps: I would love to implement the feature myself but since there is not currently testing data for the DIA use case, I am scared of breaking backward compatibility. If you could add testing files for that I would love to make a draft PR that we could discuss further.

jpfeuffer commented 1 year ago

I agree with @jspaezp I think "assay name" is not necessarily the prefix. In theory people can put there whatever they want to describe that file. It just happens what we always picked the filename in our SDRFs.

jspaezp commented 1 year ago

Just as a nudge, let me know if you could provide some data for testing and I would be glad to add the feature!

ypriverol commented 1 year ago

@jspaezp go ahead and implement this. What data do you need? We have a lot of tests in the quantms repo.

jspaezp commented 1 year ago

https://github.com/bigbio/pmultiqc/tree/main/tests/resources/LFQ/pipeline_results Hi there! working on this PR I noticed that the files in here are actually soft links to a path, would you mind updating them?

$ tree
.
├── __init__.py
├── resources
│   ├── LFQ
│   │   ├── experimental_design.tsv
│   │   ├── mzMLs
│   │   │   ├── SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.mzML
│   │   │   └── SF_200217_pPeptideLibrary_pool1_HCDOT_rep2.mzML
│   │   ├── pipeline_results
│   │   │   ├── out.mzTab
│   │   │   ├── software_versions_mqc.yml -> /home/chengxin/newPR/quantms/work/39/5e77e1237a8605bb240b297a222f41/software_versions_mqc.yml
│   │   │   └── workflow_summary_mqc.yaml -> /home/chengxin/newPR/quantms/work/tmp/8d/425512a013a60d1e13521a94b94c01/workflow_summary_mqc.yaml
│   │   └── raw_ids
│   │       ├── SF_200217_pPeptideLibrary_pool1_HCDOT_rep1_comet_feat_perc.idXML
│   │       ├── SF_200217_pPeptideLibrary_pool1_HCDOT_rep1_msgf_feat_perc.idXML
│   │       ├── SF_200217_pPeptideLibrary_pool1_HCDOT_rep2_comet_feat_perc.idXML
│   │       └── SF_200217_pPeptideLibrary_pool1_HCDOT_rep2_msgf_feat_perc.idXML
│   └── PXD005942-Sample-25-out.mzTab
└── test_proteomicslfq.py

6 directories, 13 files

ypriverol commented 1 year ago

This has been implemented already.

jpfeuffer commented 1 year ago

While this might work now, I still feel that pmultiqc/quantms is too mzML focussed. Variables shouldn't be called mzml_xyz etc. because the info could in theory come from other types of raw files (especially in the future).

bigbio / pmultiqc

[Discussion] Hard coded mzml paths #80