galaxyproteomics / tools-galaxyp

Galaxy Tool Shed repositories maintained and developed by the GalaxyP community
MIT License
34 stars 58 forks source link

thermo raw file converter changes names of files in the collection #706

Closed hechth closed 1 year ago

hechth commented 1 year ago

While the thermo raw file converter leaves the display labels of samples intact, it renames the files in a collection using the tool name - this makes downstream analysis which relies on filenames (which I know is not best, but common, practice) impossible and breaks quite some downstream tools. Is it possible to keep the filenames intact and not change those? msconvert actually leaves those intact - this behaviour should be adopted.

bgruening commented 1 year ago

Yes I think this is possible, however the names in collections needs to be unique and dataset "names" are not guarenteed to be unique. Do you want to work on it?

bernt-matthias commented 1 year ago

While the thermo raw file converter leaves the display labels of samples intact, it renames the files in a collection using the tool name

This should not be the case. If you apply the converter to a collection it will be mapped over and the collection elements of the output collection should be exactly (and in any case) identical the the identifiers of the input collection.

If not, then could you show us / detail how you apply the converter to a collection.

The only point that might be problematic is that the tool tries to save the file on disk using the element name (but replaces some characters that can not be used in file names): https://github.com/galaxyproteomics/tools-galaxyp/blob/ad72f82951c17d9b9e200efc8ac7b00bf3a72cc0/tools/ThermoRawFileParser/thermo_converter.xml#L17 .. this might be stored in the output (still this should not affect output collection identifiers) .. which might lead to problems.

We could try to allow more characters here. Currently everything that is not a alphanumeric char, -, ., or _ is forbidden and replaced by _. For instance allowing " " (space) could increase usability. But (quite) likely create downstream problems because handling files with spaces is annoying (I guess because only people from the windows world use such things :) )...

I have the feeling that it would be best to solve this by documenting which characters should be used for sample names.

hechth commented 1 year ago

@bernt-matthias the diplay label is the same as the file indentifier but the dataset name isn't - see this history: https://umsa.cerit-sc.cz/u/hechth/h/20230425-thermo-convert-bug

The problems come in downstream tools, like XCMS takes the collection of files and outputs a table based on the file identifiers - if those change in between, the names in the sample metadata table then don't match anymore.

bernt-matthias commented 1 year ago

Collection 4 and 5 are fine, right? Everything has the same identifier (but maybe not the name .. but this should not be used).

Problem is Dataset 14? But this is not a collection, but a single dataset. I see that this would not work, but you could just document that users should always use collections?

Do you have a link to a specific xcms tool that creates problems.

hechth commented 1 year ago

@bernt-matthias I still struggle to understand what is the difference between the name and identifier - why are the both of them there and what are they used for? If name should not be used, then why is it there?

Well, basically the whole workflow uses the dataset identifier in all outputs, so peak and intensity tables etc. - see the latest outputs in the history: https://umsa.cerit-sc.cz/u/hechth/h/20230425-thermo-convert-bug

bernt-matthias commented 1 year ago

As far as I understand it:

Maybe a little more background:

If you think of a multi sample (e.g. sample A, sample B) analysis involving many steps, e.g. tool 1, tool 2, ..., tool X (where maybe some of the tools can run in parallel, i.e. in an unpredictable order). Then we have to options to name tool outputs if we use just datasets:

1) by sample names: then we would have X times the output named sample A, and X times the output sample B, ... and it would be hard to find out which output was generated by which tool (of course you could in the job info .. but we want to have this more obvious). 2) use the default Galaxy naming scheme Tool X on dataset I: ..., obviously this is also not helpful for a multi sample analysis.

So both naming schemes are no applicable. Here collections came in handy because they allow to organize datasets as elements of a collection and:

1) use sample names as collection (element) identifiers, i.e. collection elements will have the same name throughout the analysis .. the sample name 2) the collection itself has a name which will/should follow the Galaxy output convention (Tool X on dataset I).

Maybe this docs are helpful: https://planemo.readthedocs.io/en/latest/writing_advanced.html?highlight=element_identifier#processing-identifiers

My abstraction of collections is that they are something like a folder containing symlinks to the actual datasets (which really exist as hidden element in the history !! ) .. The element identifier is then the name of the symlink and the name the name of the actual dataset.

Long story short: I think the problem is actually that xcms uses the name, but should use the identifier:

https://github.com/workflow4metabolomics/tools-metabolomics/blob/d9b3849751af6f74e3371db0c3525dcd08728723/tools/xcms_macro/macros_xcms.xml#LL20C72-L20C72

My suggestion would be to replace $single_file.name by $single_file.element_identifier.. (there may be more places where this is needed).

hechth commented 1 year ago

this sounds like a good update to XCMS as well, never the less, I think we should do the same here since if the thermo raw file converter is run on a single dataset the file_identifier will also be overwritten, so I still think that this change is necessary.

bernt-matthias commented 1 year ago

if the thermo raw file converter is run on a single dataset the file_identifier will also be overwritten

I understand the problem, but I do not see that leads to a usable history. All datasets would have the same name (except that in the linked PR .. the file extension would be used in the dataset name .. which might work for some downstream tools but likely not all)

If one really wants to analyze a single datasets: would it be a possibility to put it in a collection?

Also: are there use cases where one analyses single datasets? There should always be replicates and likely more than one sample anyway....

hechth commented 1 year ago

All datasets would have the same name (except that in the linked PR .. the file extension would be used in the dataset name .. which might work for some downstream tools but likely not all)

Why would all datasets have the same name?

I mean we have two options - we make this change here or we change all of XCMS - because in this way, thermo raw file converter is not compatible with XCMS because every item name has to be renamed. So either we change this one tool here or we change all XCMS tools.

Also, msconvert actually uses exactly the same, so this change would only make things consistent ...

https://github.com/galaxyproteomics/tools-galaxyp/blob/cdf3e9652b10c7a0b179202129a797e32fd95909/tools/msconvert/msconvert_macros.xml#LL526C78-L526C82

Please feel free to post a better solution here or if you disagree with the changes to close the PR.

bernt-matthias commented 1 year ago

Luckily xcms uses a macro: https://github.com/workflow4metabolomics/tools-metabolomics/pull/238

bernt-matthias commented 1 year ago

Hey @hechth . I think I'm finished with https://github.com/workflow4metabolomics/tools-metabolomics/pull/238 .. hope this helps. Maybe you can ping somebody for a review.

Anything else to do here?

hechth commented 1 year ago

Thanks a lot! I think all people responsible for the review are already in the conversation.