OpenMS / OpenMS

The codebase of the OpenMS project
https://www.openms.de
Other
478 stars 318 forks source link

SeedListGenerator output by prefix for consensuXML input #4404

Open bernt-matthias opened 4 years ago

bernt-matthias commented 4 years ago

For consensusXML input the number of outputs depends on the contents of the input.

In order to automatize this step it would be great to just specify a prefix of the outputs.

For instance, one could have an optional -out FILE argument that needs to be specified if the input is not consensusXML and -out_prefix PREFIX if the input is otherwise.

Background: For the CTD -> galaxy tool the automatic conversion of the consensusXML input case seems impossible. I'm wondering if there is a workaround.

jpfeuffer commented 4 years ago

True. This would be much better but this also needs to be reflected in the CTD for other wrappers. We do not yet support its "output prefix capabilities" but it is probably not too much work (it is basically just another tag).

bernt-matthias commented 4 years ago

There are seemingly already some other places where directories are reflected as simple string. A quick grep in the ctd files gave me this list:

MzMLSplitter -out PepNovoAdapter -dir IDRipper -out MascotAdapter -mascot_directory temp_data_directory InspectAdapter -inspect_directory temp_data_directory PrecursorIonSelector -tmp_dir IDFileConverter -in OpenSwathFileSplitter -outputDirectory

Maybe adding a simple string option would be a good first step.

bernt-matthias commented 4 years ago

For the case of IDRipper there is -out_path and -out:

  -out <file>        The path to this file is used as the output directory. (valid formats: 'idXML')
  -out_path <file>   Directory for the output files after ripping according to 'file_origin'. If 'out_pa
                     th' is set, 'out' is ignored.

When specifying -out_path DIR the files are located in the current working dir. Only with -out_path DIR/FILE (where FILE does not even need to be existent) the generated files are located in DIR.

bernt-matthias commented 4 years ago

Also DTAExtractor seems to be an example

cbielow commented 4 years ago

Background: For the CTD -> galaxy tool the automatic conversion of the consensusXML input case seems impossible. I'm wondering if there is a workaround.

So how would this PREFIX help Galaxy? Would you assume all files with this prefix in the target directory where generated by the tool?

bernt-matthias commented 4 years ago

Exactly. The prefix would be an existing directory in Galaxy. Galaxy has means to take all files from an directory (optionally matching a regexp).

Galaxy does not know how many outputs are there and how they are named. So the easiest seems to be to use prefix. An alternative would be to implement some additional logic in the Galaxy tools. But these would need to be specific for the tools which seems difficult to automate.

For cases like IDRipper -out_path or MzMLSplitter -out the new parameter type simplifies the conversion.

cbielow commented 4 years ago

Ok, but then the safest thing to do is to create new temp-directory, have the tool create all its output files in there and just grab all the files. The prefix solution is not very safe, since there might be other files in the same directory from previous runs or who knows ...

bernt-matthias commented 4 years ago

This is exactly what Galaxy does since each job (i.e. every singe call to a openms binary) has its own working dir.

I guess the parameter could implement a check if there is a file matching PREFIX.*

jpfeuffer commented 4 years ago

I would hope the prefix can include subfolders. If not, please add this feature. E.g. if the prefix is tmp/foo it gathers all files matching a certain mimetype (if specified): i.e. $pwd/tmp/foo*.ext

bernt-matthias commented 4 years ago

The prefix can be anything. But so far the folders need to exist already (similar to normal output files which could also refer to filenames in subfolders).

The code currently checks if the prefix is writable:

https://github.com/OpenMS/OpenMS/blob/1a60d1707a58804a748557c0097c88600b7252e7/src/openms/source/APPLICATIONS/TOPPBase.cpp#L1307

Is there already some code in OpenMS to list files in a glob like manner (eg PREFIX*.ext)? Or are there any suggestions how to implement this.