Statistics generation - Githubissues

donkirkby commented 3 years ago

From @dmacmillan's summary e-mail.

We want a script to count the following: a) total samples b) number of QC "passed" samples c) number of samples that failed due to not existing ("no_sequence" error) d) number of samples that failed due to not being HIV ("non_hiv" error) e) number of samples that failed due to any primer error ("no_primer" error) f) number of samples that failed due to low internal coverage ("low_internal_cov" error) g) number of samples that failed despite it being an HIV sequence ("hiv_but_failed" error) and we want to do this on a per-run basis, per-participant-ID basis, and on an overall basis for proviral runs. In the dev branch of the proviral pipeline I have removed the old logic as it should now rely on the outcome summary file to compute its numbers, I left a skeletal structure to make things easier.

[x] Summarize counts by run, participant, and grand total.
[x] Launch HIVSeqinR for all runs, before summarizing.
[x] Decide how to handle other error types, possibly shortening error codes.
[x] Document all error types.

donkirkby commented 3 years ago

Update on item g from Zabrina:

The “hiv_but_failed” error does not exist - rather, the final error category is “multiple contigs”.

donkirkby commented 3 years ago

After discussion, we decided to group the error counts into these columns:

no_sequence
- self.no_sequence = 'no contig/conseq constructed'
non_hiv
- self.non_hiv = 'sequence is non-hiv'
no_primer
- self.no_primer = 'primer was not found'
- self.failed_validation = 'primer failed validation'
- self.primer_error = 'primer error'
low_cov
- self.low_internal_cov = 'low internal read coverage'
- self.low_end_cov = 'low end read coverage'
- new error: low coverage
multiple_contigs
- self.multiple_passed = 'sample has multiple QC-passed sequences'
- self.multiple_contigs = 'multiple contigs'

Some errors require further investigation:

self.non_proviral = 'sequence is non-proviral' - Not used anymore? We just don't report on any samples that don't use the NFLHIVDNA project.
self.hiv_but_failed = 'hiv but failed' ??? More general error that is now broken down to a detailed reason?
self.non_tcga = 'sequence contained non-TCGA/gap' ???

cfe-lab / proviral

Statistics generation #2