biocore / mg-scripts

Knight Lab internal Metagenomic processing scripts for demultiplexing, QC and host removal
BSD 3-Clause "New" or "Revised" License
1 stars 5 forks source link

Save adapter-filtered-only data #118

Closed wasade closed 8 months ago

wasade commented 9 months ago

I believe the failures here are unrelated to this PR

antgonza commented 9 months ago

I believe the failures here are unrelated to this PR

I agree, this seem errors from recent merges in some of the deps; @charles-cowart, could you pull @wasade changes and issue a new PR with fixes?

coveralls commented 9 months ago

Pull Request Test Coverage Report for Build 7825003641


Totals Coverage Status
Change from base Build 7697099897: 0.0%
Covered Lines: 2056
Relevant Lines: 2370

💛 - Coveralls
wasade commented 9 months ago

The intention was someone who is familiar with SPP would resolve those pieces

antgonza commented 9 months ago

@wasade, right, sorry, that question was for @charles-cowart ...

charles-cowart commented 9 months ago

I reviewed the new files that were generated and found that the sequence counts in the 'adapter-removal-only' files was equal to the values for 'quality_filtered_reads' found in the fastp-produced JSON files. A quick zcat of the filtered files confirmed that they are smaller than the numbers metapool and thus SPP are stating for quality_filtered_reads_r1r2.

I implemented and tested in qiita-rc a solution for this where the quality_filtered_reads_r1r2 column is renamed appropriately and the new values for quality_filtered_reads_r1r2 are generated by counting lines w/zcat. This worked out well but I later realized it was more appropriate to implement the solution into metapool.

In a nutshell metapool makes the assumption that quality_filtered_reads_r1r2 can be determined correctly from fastp-generated JSON files found in a run-directory and seqpro is what generates the quality_filtered_reads_r1r2 column in the prep-info files.

I won't add my changes to mg-scripts here. However Antonio and I both think that it's useful to keep generating these fastq files, even if we don't need them to generate the counts.

Below is a link to the changes to metapool. Fixes for the tests will appear before Monday. https://github.com/biocore/metagenomics_pooling_notebook/pull/177

charles-cowart commented 8 months ago

I removed the updates to make the tests work properly with the updated metapool, since they are now merged into the master branch. We'd still like to have the mods from this PR to generate the adapter-trimmed files. I will go ahead and merge since Antonio has already approved.