Add ability to calculate library efficiency/complexity/cluster factor

jfy133 commented 3 years ago

Optional flag for aligned reads before / after deduplication
Also allow for calculation of post deduplicat endogenous

aidaanva commented 3 years ago

@jfy133 see changes in the branch cluster_factor. We need to decide how we name each calculation.

jfy133 commented 3 years ago

cluster_factor > clonality

Still thinking for endogenous....

jfy133 commented 3 years ago

Yeah,

clonality rather than cluster factor

and honestly,I think my prefered for endogenous is

percent on target rather than endogenous DNA

We can still refer to endogenous DNA in help (e.g. percent on-target a.k.a. endogenous) etc., I know it sucks because of the cool name but it's the most accurate and most desciptive/flexible term

jfy133 commented 3 years ago

@aidaanva what do you think?

aidaanva commented 3 years ago

@jfy133 you can go ahead and test the version in the cluster factor branch. If it works as you had in mind I will merge it with the main branch

jfy133 commented 3 years ago

This would be the classic 'eager' run, i.e. normal endogenous (raw mapped / total) & normal cluster factor

$ ./endorS.py tests/samtools_flagstats_prequality.txt -d tests/samtools_flagstats_postdedup.txt 
Only one samtools flagstat file provided
WARNING: No post quality filtering samtools flag provided, no Percent on Target modified post dedup (%) nor clonality calculated
Traceback (most recent call last):
  File "./endorS.py", line 154, in <module>
     "percent_on_target_post": endogenousPost, 
NameError: name 'endogenousPost' is not defined

This technically produces the 'Schroeder' endogenous value, but here is listed as on target modified (this could actually also be calculated in the command in comment 1

$ ./endorS.py tests/samtools_flagstats_prequality.txt tests/samtools_flagstats_postdedup.txt 
Percent on Target raw (%): 2.109394
Percent on Target modified (%): 1.279438
All done!

Multiple things:
- I get a different clonality (cluster factor - 284754/256325 = 1.11) value here
- You could also put for the stdout Clonality (a.k.a. cluster factor), as this isn't a standardised output.
- I screwed up the note on hackmd for the calculation of percent duplicates (my bad). It should be percentage of reads that had duplicates, i.e. the number of reads removed minus total before dedup, divided by total before dedup (and multiplied by 100: (284754 - 256325) / 284754 * 100). The reported 90% is of non-duplicates, not percentage that were duplicates. PICARD docs: https://broadinstitute.github.io/picard/picard-metric-definitions.html
```
$ ./endorS.py tests/samtools_flagstats_prequality.txt tests/samtools_flagstats_postquality.txt -d tests/samtools_flagstats_postdedup.txt 
Percent on Target raw (%): 2.109394
Percent on Target modified (%): 1.42134
Percent on Target modified post deduplication (%): 1.279438
Clonality: 1.648688
Percent Duplicates (%): 90.016295
All done!
```

jfy133 commented 3 years ago

Also, if it's Ok with you could you make a PR and I review before merging into master? Woudl like to make a couple of suggestions for the help

aidaanva commented 3 years ago

Ok, so I have now modified the calculation of duplicates. however I have comments for your points. Point 1: you can't calculate clonality this way, as this should be done post quality filtering, so you need to provide the 3 files here. That's what we already talked about. Point 2: This is not how endors.Py should be run. I can't organically change the name of the variable without knowing how you are going to use this, the code was thought to calculate endogenous DNA pre quality and post quality filtering. if you want to do that I need to add more flags. Point 3: I calculate cluster factor as: mappedPre / mappedPostD . What you are calculating I believe is mappedPost / mappedPostD. We should have a discussion about how to calculate this.

aidaanva commented 2 years ago

Comments from meeting 30/09/2022:

Clonality Calculation: mappedPost / mappedPostD
Fix the % of Duplicates (it should be: 100 - the value currently calculated)
Provide each of the flagstats with one specific flag (--raw ; --qf; --dedup) 1 -> % on target raw (mappedPreQF/#total)100 2 -> % on target modified (mappedPostQF/#total)100 3 -> % on target PostDedup (mappedPostQFDedup/#total)100 4 -> Clonality (mapped(postQF)/mapped(PostQF)Dedup) 5 -> % duplicates (1-(mapped(postQF)-mapped(PostQF)Dedup)/mappedPostQF)100
Combinations:

raw --> 1
QF --> ERROR
Dedup --> 4 and 5
raw + QF --> 1 and 2
raw + Dedup --> 1,3,4,5
QF + Dedup --> 4 and 5
raw + QF + Dedup --> 1,2,3,4,5

Make a function for % target, clonality and one for % duplicates.

aidaanva commented 2 years ago

@jfy133 please test the changes on the cluster-factor branch

jfy133 commented 2 years ago

@aidaanva it would be easier if you could open a PR and I can comment there (I noticed a few typos, for example) but in the mean time:

I would make a test directory and put all the test files in there
- We could then set up GitHub actions and just run a few tests to make sure the numbers don't change
I would make your flag parameters consistent, double dash long, single dash single character
- -qF should be -q, -dedup should be just -d
help for --raw: not adpater removal performed - this isn't correct, just no quality filtering not deduplciation. You should do AR before mapping!
You should add a --vebose flag to throw the messages like here:
```
No samtools flagstat --qualityfiltered provided. 
WARNING: Percent on Target Modified stat will not be calculated
No samtools flagstat --deduplicated provided. 
WARNING: Percent on Target PostDedup, Clonality and Percent of Duplicates stats will not be calculated!
Percent on Target raw (%): 2.109394
All done!
```
Because firstly a user would expect only a single flag (they can then pipe to parse for downstream steps, and not have to ignore a bunch of other things), also these aren't really WARNINS, if you supply just -r you normally only expect to get % on target anyway, so it's a possibly a bit confusing.
Combination 1: Fine ☑️
Combination 2: Fine ☑️ but should exit code something other than 0 (maybe look up for standard exit codes for wrong user input or something like here)
Combination 4 above (raw/qf) I would keep phrasing consistent - quality filtered not modifed given we've decided to be more explicit
Combination 5: typo Percent on Tardet PostDedup (%): 1.279438 (also shoud make Target lower case everywhere)
Combination 6: Fine ☑️
Combination 7: Fine ☑️

I wonder if we should document somehwere that ALL %on target values are ALWAYS based on the raw reads into mapping, so endorSpy will never report with the dinnomator somewhere else? I guess you need to updat ethe README anyway

aidaanva / endorS.py

Add ability to calculate library efficiency/complexity/cluster factor #6