Closed jspaezp closed 1 year ago
Great idea, what do you think @WangHong007 @daichengxin
Hi @jspaezp. Do you just want the prefix and then add the right suffix in the corresponding place? We actually already have that prefixes in assay name
column from SDRF file. BTW, what is the cause of your error message here? Is your file prefix or suffix wrong?
Hello @WangHong007 ! thanks for the reply! I look forward to collaborating with you!
I am working on a fork of the pipeline that bypasses the .mzml conversion to run DIANN, so the file format that DIANN recieves is a .d, making it such that MYFILE.mzML
does not exist, because the table contains the file MYFILE.d
. On the other hand, some QC does happen in a converted .mzML. THEREFORE, I am suggesting that if not all names match 1:1 between tables, there will be an attempt to find a common prefix between them. (thus matching "MYFILE" regardless of the extension)
Regarding your comment, I don't totally understand what you mean, If I go by the example in this page: https://github.com/bigbio/proteomics-sample-metadata/tree/master/sdrf-proteomics the assay name
seems like an alias unrelated to the file name.
https://github.com/nf-core/quantms/issues/64 https://github.com/bigbio/quantms/pull/275
ps: I would love to implement the feature myself but since there is not currently testing data for the DIA use case, I am scared of breaking backward compatibility. If you could add testing files for that I would love to make a draft PR that we could discuss further.
I agree with @jspaezp I think "assay name" is not necessarily the prefix. In theory people can put there whatever they want to describe that file. It just happens what we always picked the filename in our SDRFs.
Just as a nudge, let me know if you could provide some data for testing and I would be glad to add the feature!
@jspaezp go ahead and implement this. What data do you need? We have a lot of tests in the quantms repo.
https://github.com/bigbio/pmultiqc/tree/main/tests/resources/LFQ/pipeline_results Hi there! working on this PR I noticed that the files in here are actually soft links to a path, would you mind updating them?
$ tree
.
├── __init__.py
├── resources
│ ├── LFQ
│ │ ├── experimental_design.tsv
│ │ ├── mzMLs
│ │ │ ├── SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.mzML
│ │ │ └── SF_200217_pPeptideLibrary_pool1_HCDOT_rep2.mzML
│ │ ├── pipeline_results
│ │ │ ├── out.mzTab
│ │ │ ├── software_versions_mqc.yml -> /home/chengxin/newPR/quantms/work/39/5e77e1237a8605bb240b297a222f41/software_versions_mqc.yml
│ │ │ └── workflow_summary_mqc.yaml -> /home/chengxin/newPR/quantms/work/tmp/8d/425512a013a60d1e13521a94b94c01/workflow_summary_mqc.yaml
│ │ └── raw_ids
│ │ ├── SF_200217_pPeptideLibrary_pool1_HCDOT_rep1_comet_feat_perc.idXML
│ │ ├── SF_200217_pPeptideLibrary_pool1_HCDOT_rep1_msgf_feat_perc.idXML
│ │ ├── SF_200217_pPeptideLibrary_pool1_HCDOT_rep2_comet_feat_perc.idXML
│ │ └── SF_200217_pPeptideLibrary_pool1_HCDOT_rep2_msgf_feat_perc.idXML
│ └── PXD005942-Sample-25-out.mzTab
└── test_proteomicslfq.py
6 directories, 13 files
This has been implemented already.
While this might work now, I still feel that pmultiqc/quantms is too mzML focussed. Variables shouldn't be called mzml_xyz etc. because the info could in theory come from other types of raw files (especially in the future).
Hey there!
Me again ... If I am understanding the code correctly, when reading the DIA-NN report, it tries to match the data in the experimental design and changes the extension to mzML. Since I am working on bypassing the conversion to mzML, that leads to a:
What do you think about the idea of attempting to generate concensus names? (find common prefix if names dont match perfectly).
This would be the chunk of code: ... of many ... where the matching happens
https://github.com/bigbio/pmultiqc/blob/a25b2b9f26b1ba3c33d02cba6464a2dc7e02ed4f/pmultiqc/modules/quantms/quantms.py#L148-L149
I would voluteer to add that if there was data for the tests in the current state. I only see data for the dda workflow.