diminish pipeline complexity: pass sample always with YAML

fmalmeida / bacannot

Generic but comprehensive pipeline for prokaryotic genome annotation and interrogation with interactive reports and shiny app.

https://bacannot.readthedocs.io/en/latest/

GNU General Public License v3.0

98 stars 9 forks source link

diminish pipeline complexity: pass sample always with YAML #35

Closed fmalmeida closed 2 years ago

fmalmeida commented 2 years ago

This issue is somewhat related to what is also discussed in other issues and may affect them, such as:

What is the idea of this issue? Is to remove complexity from the pipeline. Instead of having a workflow for single samples with --genome paramter, and another for multiple samples with --in_yaml, I want remove the parameters related to "single sample" analysis and make the inputs be always given using the YAML samplesheet, either for 1 or 1000 samples.

Therefore, since our focus are users with less familiarity with bioinformatics, it would also make sense to change the parameters to make them more intuitive:

--in_yaml to --input
--outdir to --output

abhi18av commented 2 years ago

Hi there :)

Is there any specific reason why you'd like to use YAML instead of the CSV/TSV for capturing information for samples?

fmalmeida commented 2 years ago

Hi!

Well, the only reason was that using the YAML was easier for me to implement optional parameters that could be or not be passed inside the YAML, without affecting the file structure.

For instance, if a CSV expected the following:

sample id,resfinder species,nanopore reads,fast5

If users had a sample without resfinder, they would have to set:

sample_1,NA,nano.fastq,fast5_pass

Making sure to respect the expected columns, while with the YAML it would not be necessary.

That was the main reason. If you have an idea how a TSV/CSV could be seamlessly allowed without being too restrictive, we could try to implement the possibiliy to give either a YAML or a CSV/TSV.

abhi18av commented 2 years ago

Hey @fmalmeida ,

The reason that I recommend CSV/TSV is simply because of the overall user experience, YAML looks clean for sure, however the problem of formatting is really hard to capture and warn the user about malformatted YAML file and from what I've seen, the ideal end-users of the pipeline (biologists) aren't really aware of this format and possible indentation based pitfalls.

On the other hand, CSV is pretty much the standard which is used for managing the samplesheets.

Plus there is a native [splitCsv](https://www.nextflow.io/docs/latest/operator.html#splitcsv) operator with some inbuilt goodies to parse and manage the CSV fields.

For the case of a missing value for a field, perhaps we can introduce a value like NA or NULL etc - better to be explicit in the samplesheet - what are your thoughts?

fmalmeida commented 2 years ago

Hi @abhi18av,

This is a very possible idea. In fact, the first step of the pipeline is to convert the YAML into a CSV in order to be parsed with the .splitCsv function.

Therefore, I guess it would not be too complicated ... And actually, based on what you said, it'd be nice to accept both formats.

So, I think we could have something like this:

To day, the first steps of the pipeline are:

Parse the YAML and create a CSV
Parse a CSV to create the channels
Enter the pipeline

Thus, I believe we could have a parameter that loads directly the CSV in the second step. Adjusting the function in the second step to understand either the CSV from the parsed YAML and also the CSV given by the user that may or may not have some fields.

What do you think?

If we establish something nice here, I can also make this option available in my other pipeline as well that follows the same.

abhi18av commented 2 years ago

... CSV given by the user that may or may not have some fields.

Agreed!

This way we reuse the existing code almost entirely. Though, considering the UX, perhaps we can just have one --input and based on the suffix of this file csv/tsv/yaml/yml etc, we can trigger the appropriate logic.

My personal preference, is to really keep the pipeline parameters to a minimal - what do you think?

If we establish something nice here, I can also make this option available in my other pipeline as well that follows the same.

Sure, I agree. However, for the time being, let's discuss all the (design/tower) changes in this pipeline first, test and then port them over in a batch fashion.

fmalmeida commented 2 years ago

Though, considering the UX, perhaps we can just have one --input

Yes, this is ideal. I discussed it a little bit in the issue opening. I want to diminish the complexity and number of parameters.

For that, I will actually remove the possibility of passing this "file related" parameters that are used in the samplesheet from the command line, and let the pipeline only accept the samplesheet for that using a --input that checks the suffix

changes in this pipeline first, test and then port them over in a batch fashion.

Yes, of course.

😉