Generate amplicon mapping (prep) files with metapool/scripts/seqpro.py

ElDeveloper commented 3 years ago

Currently we generate the mapping files for an amplicon run with the amplicon-pooling.ipynb notebook, however this should be changed to work the same was as with metagenomics runs. Namely using the sequencing run folder as an input in combination with the sample sheet.

For example, for metagenomics, we run:

seqpro /sequencing/ucsd_2/complete_runs/210903_A00953_0391_AHKJ52DSX2/ sample-sheet.csv output-folder

In the metagenomics case, output-folder will produce a preparation (mapping) file per project and per lane.

Therefore, we want mg-scripts (CC @charles-cowart) to call seqpro for amplicon runs too, such that we would get a file per project. It is important that we generate the mapping (prep) file after the run is completed so we can populate fields such as run_prefix, runid, run_date, instrument model, etc based on the run output. We already have code for how to handle this via metapool/prep.py but that is currently only for metagenomic/metatranscriptomic data.

If we do this, we'll be able to unify the way in which sequence data is processed.

charles-cowart commented 3 years ago

@ElDeveloper Gotcha. The methods we would need to extract the fields we need for Amplicon are fairly analogous to what we already have for metagenomics, right?

ElDeveloper commented 3 years ago

Yep, exactly. Biggest difference is perhaps the fact that run_prefix is a single value for all samples instead of what happens in metagenomics where each sample is assigned a run_prefix.

On Sep 29, 2021, at 2:57 PM, Charles Cowart @.***> wrote:

@ElDeveloper Gotcha. The methods we would need to extract the fields we need for Amplicon are fairly analogous to what we already have for metagenomics, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

charles-cowart commented 2 years ago

@ElDeveloper Just to confirm, generate_qiita_prep_file() (https://github.com/biocore/metagenomics_pooling_notebook/blob/f9f8438877fea0e8584ab619160c7b3d18e1479a/metapool/prep.py#L445) takes in a plate-map as a parameter, while seqpro expects a run-dir and a sample-sheet as parameters. It seems like I should be able to write an inverse operation to turn a sample-sheet into a prep-file, perhaps with 1-2 additional inputs. Does that sound in-line with what you were thinking?

ElDeveloper commented 2 years ago

Not sure about writing an inverse function. I'll leave that up to your judgement. In the end the goal should be to change preparations_for_run so it can understand when and how to process a 16S run. You can definitely use code from generate_qiita_prep_file. Does that make sense? @callaband wrote the code in generate_qiita_prep_file so definitely touch base with her if you have any questions.

The final CLI (from seqpro's point of view) should look basically the same for metagenomic and for amplicon runs.

biocore / metagenomics_pooling_notebook

Generate amplicon mapping (prep) files with metapool/scripts/seqpro.py #40