💡 [REQUEST] - Sample sheet generation

gregdenay commented 1 month ago

Is your feature request related to a problem?

Generating a sample sheet with the file paths now relies on third party solutions, either an external script or a user generated (with excel or bash) table. This is quite inpractical as it forces users to invest time in creating tables, or looking for third party software if they do not have the nescessary skills to write scripts.

Describe the solution you'd like

Ideally users could either provide a custom sample sheet or a path or list of path to folders. If providing paths then a script should collect .fastq files in these folders, eventually pair them (auto-detect or a ' --pe' argument) and generate the sampel list for the rest of the workflow automatically. It should even be possible to allow users to provide a regex as argument for cutom file naming.

Description

The BfR script create_sample_sheet from the ABC pipelines would be a good place to start.

Additional context

No response

marchoeppner commented 1 month ago

Agreed, although by virtue of not requiring a download of the code base, users would need to download the samplesheet generator separately. Which is doable, but not super-duper-integrated or convenient.

Another option would be to allow an alternative input method instead of a TSV file:

reads = Channel.fromFilePairs(params.path_to_reads)

Where path_to_reads could be a wildcard like

--path_to_reads '/path/to/reads/*_R{1,2}_001.fastq.gz'

which would read all fastq files, and group them based on the section matching the '*'.

This option has some drawbacks when it comes to multi-lane sequencing setups (i.e. we now use the sample_id to group all reads belonging to the same sample - which this solution could not account for). But just to throw it in the ring as an option.

gregdenay commented 1 month ago

If we decide to go for it it should be packaged as a module so that it satys transparent for users, ideally the nextflow call would looki somehting like this:

nextflow run bio-raum/FooDMe2 
  -profile myprofile \ 
  -r main \ 
  --input /path/to/raw/data/ \
  --run_name pipeline-test \
  --primer_set amniotes_dobrovolny

The module would find (and pair) files in the folder, extract names and create the sample channels. The information of whether to look for Illumina or other names is already contained in other arguments and the module should be able to handle tsv input as well for the special cases.

Do you think it is doable ? Or ideas about a better implementation?

marchoeppner commented 1 month ago

I don't think that would work, since --input is already used for the samplesheet. I cannot think of a neat way to have the pipeline decide internally what it is seeing behind that parameter and how to proceed.

In nf-core, the typical way was the one I described, so something like:

--input samples.tsv and --reads /path/to/reads/*_R{1,2}_001.fastq.gz

where these two options (--reads, --input) are mutually exclusive. For the --reads option, we would also need a small function to map the reads into a meta dictionary (easy enough).

gregdenay commented 1 month ago

Ah ok, that's what you meant. That looks easy enough

marchoeppner commented 1 month ago

Initial implementation in commit 01574eb

gregdenay commented 1 month ago

That looks great! Just a note, the lane info is missing from the regex pattern. I would update the doc to provide a list of regex for usual file naming patterns.

marchoeppner commented 1 month ago

True, although lanes cannot be considered in this approach, so if someone has libraries spread across lanes, the whole thing goes "boom". I suppose one could, with added work, collect all the reads first, and then see if they contain data from multiple lanes. But the logic for that would need to be internal to the pipeline, which invites all kinds of problems depending on what naming scheme people use.

But good idea to add typical regexp patterns!

gregdenay commented 1 month ago

As I understand it there is no way to reduce SampleName_SXX_LYYY_RZ_001.fastq.gz to just SampleName with this approach? And it can only deal with paire-end data so that won't work with other techs. Not sure I am entirely satisfied with this.

What don't you like about adding the create_sample_sheet script from BfR (or a modified version) as a module for data collection?

marchoeppner commented 1 month ago

It should be able to deal with single-end data since I set up the Channel factory with the size=-1 option which is agnostic to the number of files. The next bit then checks if this is single-end or paired-end based on how many files matched the pattern.

Again, if the pattern is only slighty off, the input will be fairly nonsensical.

The samplesheet script is fine, just pointing out that it would need to be separate from the pipeline - or we build a third workflow under workflows/ called "create_sample_sheet" so it is useable without a separate download?

Failing that, linking a python script from the FooDMe2 docs is also fine of course

gregdenay commented 1 month ago

I don't get why it needs to be external. Isn't it possible with a path as input to run a module that creates channels based on the result of some script?

If it can handle single-end too it's not so bad then, just a bit ugly that the sample and lane info stays part of the file name but that's not critical. Let's leave it like this and see how usage is.

marchoeppner commented 1 month ago

I see what you mean - like, generating a samplesheet and pass that to input as an alterantive to specifying input from the command line. I think the main issue I have with that is that people do not get to see their samplesheet prior to processing. Which is always a bit...uhm. Too much "magic" ;) But in principle, yes, that could probably be done by re-structuring the whole input logic a little bit.

bio-raum / FooDMe2