biocore / metagenomics_pooling_notebook

Jupyter notebooks to assist with sample processing
MIT License
8 stars 16 forks source link

Collect sequence counts during prep file generation #35

Closed ElDeveloper closed 2 years ago

ElDeveloper commented 2 years ago

Running seqpro in a run folder with the latest pipeline will add three columns to the preparation file bcl_counts, fastp_counts and minimap2_counts. Each column describes how many records were found per sample at each stage. If one of the steps is missing then NA will be used instead (if some samples are not human-filtered the minimap2_counts column will be full of NA values).

The collection is all based on log files therefore if the log files are not present then there won't be sequence counts in the table.

@charles-cowart @antgonza for this PR it would be nice to have 2 reviewers.

ElDeveloper commented 2 years ago

Thanks both.

@antgonza Yes, the code looks at all the outputs from mg-scripts. And I think it would make sense to have this there, but at the time when "seqpro" (the scripts that creates prep files) came to be we didn't have fully fledged mg-scripts structure/repo. Happy to move it over, but I don't think that's critical (for now). (2) This is not intended to be run in a notebook, this is intended to be run at the end of mg-scripts, when the prep file is generated. And the parsing of the log files should be fairly quick. For example, in a run you would:

Down the road, we should create the preparation based on the sequence data and the prep, but that involves some changes in the klp plugin.

On Sep 24, 2021, at 6:06 AM, Antonio Gonzalez @.***> wrote:

@antgonza approved this pull request.

Looks good, thank you. Some of the code seems similar to what's being done in mg-scripts, right? As it basically checks it's outputs so (1) do you think it should live there or what's the plan to integrate? Also, (2) the other concern, is how long will running these changes/code take in a real run and would running in a notebook actually work?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.