artic-network / fieldbioinformatics

The ARTIC field bioinformatics pipeline
MIT License
110 stars 68 forks source link

Support for Guppy-based demultiplexing #5

Closed sagrudd closed 4 years ago

sagrudd commented 4 years ago

Dear Nick, PR here is aiming to provide some basic support for running the ARTIC workflow on sequence data that has already been demultiplexed by Guppy - following feedback from users in China there is a case for removing the requirement for the sequencing summary file for medaka workflow. To make this workflow minimally instrusive to existing code I have created a target called guppyplex in the pipeline.py - this aims to provide equivalent functionality to both the gather and demultiplex steps. I also have a snakefile for the automation of the workflow - would you like that committed here too - would welcome your thoughts here - very pleased to discuss further offline - Cheers - S

nickloman commented 4 years ago

Thanks for this, I'll take a look in a bit.

nickloman commented 4 years ago

Dear @sagrudd I merged this in haste:

Did you mean: if get_read_mean_quality(rec) < args.quality: ??

nickloman commented 4 years ago

Also what is purpose of: r = random() if r >= args.sample: continue

sagrudd commented 4 years ago

(1) - get_read_mean_quality(rec) < args.quality - is a filter for mean read quality (as calculated by Biopython) - this is working towards reduction of the total number of reads in cases where read-count is high (I know that the minion step can also subsample) - I intended '<=' here since if I ask for a read filter at Q=7.5 I would like reads with Q of 7.5 included - the actual impact of using '<' will be negligible

(2) - I have included a parameter for sub-sampling large sequence collections (args.sample) - again for reasons of reduction if there is a silly excess of reads. random() returns a float scaled between 0..1 - sample should be a float - if I specify a subsample of 0.25 - we will skip any reads where the random value is >= 0.25 and thus should reduce the data dimension accordingly

This value is 1 by default and no-skipping will be performed

Pleased to chat offline if this would be more productive?

nickloman commented 4 years ago

1) yeah but the logic was inverted in the original code, i.e. it was dropping anything with Q>7

2) gotcha!

sagrudd commented 4 years ago

d'oh! would you like me to submit a merge request for (1) or can you change? Apologies for the muppetness here ...