Closed jfy133 closed 1 year ago
@jfy133 see changes in the branch cluster_factor. We need to decide how we name each calculation.
cluster_factor > clonality
Still thinking for endogenous....
Yeah,
and honestly,I think my prefered for endogenous is
We can still refer to endogenous DNA in help (e.g. percent on-target a.k.a. endogenous) etc., I know it sucks because of the cool name but it's the most accurate and most desciptive/flexible term
@aidaanva what do you think?
@jfy133 you can go ahead and test the version in the cluster factor branch. If it works as you had in mind I will merge it with the main branch
This would be the classic 'eager' run, i.e. normal endogenous (raw mapped / total) & normal cluster factor
$ ./endorS.py tests/samtools_flagstats_prequality.txt -d tests/samtools_flagstats_postdedup.txt
Only one samtools flagstat file provided
WARNING: No post quality filtering samtools flag provided, no Percent on Target modified post dedup (%) nor clonality calculated
Traceback (most recent call last):
File "./endorS.py", line 154, in <module>
"percent_on_target_post": endogenousPost,
NameError: name 'endogenousPost' is not defined
This technically produces the 'Schroeder' endogenous value, but here is listed as on target modified (this could actually also be calculated in the command in comment 1
$ ./endorS.py tests/samtools_flagstats_prequality.txt tests/samtools_flagstats_postdedup.txt
Percent on Target raw (%): 2.109394
Percent on Target modified (%): 1.279438
All done!
Multiple things:
284754/256325 = 1.11
) value here stdout
Clonality (a.k.a. cluster factor), as this isn't a standardised output.(284754 - 256325) / 284754 * 100
). The reported 90% is of non-duplicates, not percentage that were duplicates. PICARD docs: https://broadinstitute.github.io/picard/picard-metric-definitions.html$ ./endorS.py tests/samtools_flagstats_prequality.txt tests/samtools_flagstats_postquality.txt -d tests/samtools_flagstats_postdedup.txt
Percent on Target raw (%): 2.109394
Percent on Target modified (%): 1.42134
Percent on Target modified post deduplication (%): 1.279438
Clonality: 1.648688
Percent Duplicates (%): 90.016295
All done!
Also, if it's Ok with you could you make a PR and I review before merging into master? Woudl like to make a couple of suggestions for the help
Ok, so I have now modified the calculation of duplicates. however I have comments for your points. Point 1: you can't calculate clonality this way, as this should be done post quality filtering, so you need to provide the 3 files here. That's what we already talked about. Point 2: This is not how endors.Py should be run. I can't organically change the name of the variable without knowing how you are going to use this, the code was thought to calculate endogenous DNA pre quality and post quality filtering. if you want to do that I need to add more flags. Point 3: I calculate cluster factor as: mappedPre / mappedPostD . What you are calculating I believe is mappedPost / mappedPostD. We should have a discussion about how to calculate this.
Comments from meeting 30/09/2022:
Make a function for % target, clonality and one for % duplicates.
@jfy133 please test the changes on the cluster-factor branch
@aidaanva it would be easier if you could open a PR and I can comment there (I noticed a few typos, for example) but in the mean time:
I would make a test directory and put all the test files in there
I would make your flag parameters consistent, double dash long, single dash single character
-qF
should be -q
, -dedup
should be just -d
help for --raw
: not adpater removal performed - this isn't correct, just no quality filtering not deduplciation. You should do AR before mapping!
You should add a --vebose
flag to throw the messages like here:
No samtools flagstat --qualityfiltered provided.
WARNING: Percent on Target Modified stat will not be calculated
No samtools flagstat --deduplicated provided.
WARNING: Percent on Target PostDedup, Clonality and Percent of Duplicates stats will not be calculated!
Percent on Target raw (%): 2.109394
All done!
Because firstly a user would expect only a single flag (they can then pipe to parse for downstream steps, and not have to ignore a bunch of other things), also these aren't really WARNINS, if you supply just -r
you normally only expect to get % on target anyway, so it's a possibly a bit confusing.
Combination 1: Fine ☑️
Combination 2: Fine ☑️ but should exit code something other than 0
(maybe look up for standard exit codes for wrong user input or something like here)
Combination 4 above (raw/qf) I would keep phrasing consistent - quality filtered
not modifed
given we've decided to be more explicit
Combination 5: typo Percent on Tardet PostDedup (%): 1.279438
(also shoud make Target
lower case everywhere)
Combination 6: Fine ☑️
Combination 7: Fine ☑️
I wonder if we should document somehwere that ALL %on target
values are ALWAYS based on the raw reads into mapping, so endorSpy will never report with the dinnomator somewhere else? I guess you need to updat ethe README anyway