bigbio / sdrf-pipelines

A repository to convert SDRF proteomics files into pipelines config files
Apache License 2.0
16 stars 21 forks source link

[Pitch] Partial keep raw in OpenMS #145

Closed jspaezp closed 1 year ago

jspaezp commented 1 year ago

Hello,

I am working on a project where I would like to convert an sdrf to an openms workflow where I would like to keep the extension as is, only for some files.

(I am also implementing the workflow... https://github.com/nf-core/quantms/issues/64#issuecomment-1666645298)

Ideas:

  1. parse_sdrf convert-openms {other options/args} --raw {.d/.raw/all/none)
    # --raw .d would keep the raw extenion if it is .d, make it .mzML otherwise
    # --raw all would keep all as raw
    # --raw none would convert all to mzML
  2. 
    parse_sdrf convert-openms {other options/args} --extension_convert "raw:mzML,mzml:MZML,mzML:mzML,d:d"

I am not that creative with names ...

if --extension_alias is set, it will try to replace with the passed options

and error out if a file does not match

In this example

specs.raw -> specs.mzML

specs.mzml -> specs.mzML

specs.mzML -> specs.mzML

specs.d -> specs.d

specs.docx -> # NOTHINGS, ERROR raised



let me know what you think and if the feature would be something you want me to make a PR for

Places where changes would go:
https://github.com/bigbio/sdrf-pipelines/blob/adf9279a5c1c03d578575c4a89dfd535af8715c2/sdrf_pipelines/openms/openms.py#L582-L586
(there is another place in the same file)

https://github.com/bigbio/sdrf-pipelines/blob/adf9279a5c1c03d578575c4a89dfd535af8715c2/sdrf_pipelines/parse_sdrf.py#L37C1-L37C1
ypriverol commented 1 year ago

@jspaezp I like option 2. What do you think @jpfeuffer @daichengxin @fabianegli ?

jpfeuffer commented 1 year ago

Yes maybe we should do "2.", just in case we allow mixed designs at some point.

Name of the option could also be "extension_map(ping)"

ypriverol commented 1 year ago

I have finished the first iteration of this implementation. @jpfeuffer Can I ask you something, will the name only changed in the openms.tsv but in the experimentl_design.tsv they are keep with the original name?

jpfeuffer commented 1 year ago

So, to be honest, after thinking about it a bit more, I think the safest way would be to always keep the original filenames everywhere.

BUT this means that you need to pass the design along the workflow for a bit longer, until the actual conversion step. The actual conversion step will then precisely know what it is converting and which filenames to replace in the design.

By the way, has anyone tried with having the "raw" names in the design first? I thought @timosachsenberg implemented a file basename check only in the affected OpenMS tools.

jpfeuffer commented 1 year ago

I'm saying this because in n theory, this is all a hack based on the idea "we know that it most likely needs to be converted at some point in the future to mzML", which is not very nice.

ypriverol commented 1 year ago

@jpfeuffer I guess, this will need some refactoring in quantms and also OpenMS if we don't do conversions?

jpfeuffer commented 1 year ago

No, as I said this should be implemented.

ypriverol commented 1 year ago

Can we continue with the following approach and then do the major refactoring? In any case, this PR only include the possibilityin sdrf-pipelines to perform conversions in names. For example, if the keep names are allowed, then conversion will not be performed.

jpfeuffer commented 1 year ago

I think no changes are necessary but someone has to try.

ypriverol commented 1 year ago

I actually removed now the option keep raw in the sdrf-pipelines. That would be the default option. If we want to convert between formats we will need to use option --extension_convert including raw -> mzML and others.

jspaezp commented 1 year ago

Hello there!

This fix broke how my data was handled, If I had files that ended with .zip it would fail but finishing with exit status 0 and generating an empty experimental design.

I am working over here on fixing this, I also noticed that the testing suite always passed because "ERROR" was looked for in the output, but click errored out with the "Error" word, thus it never failed.

Right now this test fails on me, let me know if that is the intended behavior.

def test_validate_srdf(change_test_dir):
     """
     :return:
     """
     runner = CliRunner()
     result = runner.invoke(cli, ["validate-sdrf", "--sdrf_file", f"{TESTDATA_PATH}/PXD000288.sdrf.tsv", "--check_ms"])
     assert result.exit_code == 0, result.output
     assert "ERROR" not in result.output.upper()
     print("validate sdrf " + result.output)
     """
     # Currently fails with this error
     E       AssertionError: The following columns are mandatory and not present in the SDRF: comment[technical replicate] -- ERROR
     E         The column comment[technical replicate] is not present in the SDRF -- ERROR
     E         The following columns are mandatory and not present in the SDRF: comment[technical replicate] -- ERROR
     E         The column comment[technical replicate] is not present in the SDRF -- ERROR
     E         There were validation errors.
     E         
     E       assert True == 0
     E        +  where True = <Result SystemExit(True)>.exit_code
     """
ypriverol commented 1 year ago

You mean in your experimental design you have data in the way .zip and you want to convert .zip -> .d in the OpenMS experimental design?

jspaezp commented 1 year ago

Let me go back a bit ... just to make sure we are all in the same page.

If I wanted to set up a workflow with the file https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD037164/3817_TIMS2_2col-80m_37_1_Slot1-46_1_4768.d.zip; then I would build a .sdrf where that URL goes in the comment[file uri] column; therefore it would make a lot of sense to me that the comment[data file] column contains the filename from that URI (3817_TIMS2_2col-80m_37_1_Slot1-46_1_4768.d.zip).

nontheless, since we need the name in the experimental design (experimental_design, column Spectra_Filepath) to match the file name that the search engine will recieve, we would need to convert that .zip to .d.

'why do it in this stage?': Since there is now a flag in quantms that allow forcing the conversion of .d files to .mzML, makes more sense to me to allow the user to have the same starting .sdrf file, with any extension supported by the pipeline (d.gz, d.tar.gz, .d.tar, .d ...) and have the pipeline translate the extension, instead of having the user create an .sdrf with the "expected extension" of the output.

HAVING SAID THAT! even if that is not something you want, I will split the work on the testing side of things and submit a PR on it.

ypriverol commented 1 year ago

Ok, the idea is that we allow users to convert:

I actually in the logic validated that we have only some extensions supported like .d -> .d or .raw -> .mzML, etc. I agree with you that we should move to a more generic appraoch when we allow any type of conversion from any file type to any file type. I will do a PR with the new changes for you to review.