aidaanva / endorS.py

endorS.py calculates endogenous DNA from samtools flagstat files and print to screen
MIT License
2 stars 1 forks source link

Add ability to calculate library efficiency/complexity/cluster factor #6

Closed jfy133 closed 1 year ago

jfy133 commented 3 years ago
aidaanva commented 3 years ago

@jfy133 see changes in the branch cluster_factor. We need to decide how we name each calculation.

jfy133 commented 3 years ago

cluster_factor > clonality

Still thinking for endogenous....

jfy133 commented 3 years ago

Yeah,

and honestly,I think my prefered for endogenous is

We can still refer to endogenous DNA in help (e.g. percent on-target a.k.a. endogenous) etc., I know it sucks because of the cool name but it's the most accurate and most desciptive/flexible term

jfy133 commented 3 years ago

@aidaanva what do you think?

aidaanva commented 3 years ago

@jfy133 you can go ahead and test the version in the cluster factor branch. If it works as you had in mind I will merge it with the main branch

jfy133 commented 3 years ago
  1. This would be the classic 'eager' run, i.e. normal endogenous (raw mapped / total) & normal cluster factor

    $ ./endorS.py tests/samtools_flagstats_prequality.txt -d tests/samtools_flagstats_postdedup.txt 
    Only one samtools flagstat file provided
    WARNING: No post quality filtering samtools flag provided, no Percent on Target modified post dedup (%) nor clonality calculated
    Traceback (most recent call last):
      File "./endorS.py", line 154, in <module>
         "percent_on_target_post": endogenousPost, 
    NameError: name 'endogenousPost' is not defined
  2. This technically produces the 'Schroeder' endogenous value, but here is listed as on target modified (this could actually also be calculated in the command in comment 1

    $ ./endorS.py tests/samtools_flagstats_prequality.txt tests/samtools_flagstats_postdedup.txt 
    Percent on Target raw (%): 2.109394
    Percent on Target modified (%): 1.279438
    All done!
  3. Multiple things:

    • I get a different clonality (cluster factor - 284754/256325 = 1.11) value here
    • You could also put for the stdout Clonality (a.k.a. cluster factor), as this isn't a standardised output.
    • I screwed up the note on hackmd for the calculation of percent duplicates (my bad). It should be percentage of reads that had duplicates, i.e. the number of reads removed minus total before dedup, divided by total before dedup (and multiplied by 100: (284754 - 256325) / 284754 * 100). The reported 90% is of non-duplicates, not percentage that were duplicates. PICARD docs: https://broadinstitute.github.io/picard/picard-metric-definitions.html
    $ ./endorS.py tests/samtools_flagstats_prequality.txt tests/samtools_flagstats_postquality.txt -d tests/samtools_flagstats_postdedup.txt 
    Percent on Target raw (%): 2.109394
    Percent on Target modified (%): 1.42134
    Percent on Target modified post deduplication (%): 1.279438
    Clonality: 1.648688
    Percent Duplicates (%): 90.016295
    All done!
    
jfy133 commented 3 years ago

Also, if it's Ok with you could you make a PR and I review before merging into master? Woudl like to make a couple of suggestions for the help

aidaanva commented 3 years ago

Ok, so I have now modified the calculation of duplicates. however I have comments for your points. Point 1: you can't calculate clonality this way, as this should be done post quality filtering, so you need to provide the 3 files here. That's what we already talked about. Point 2: This is not how endors.Py should be run. I can't organically change the name of the variable without knowing how you are going to use this, the code was thought to calculate endogenous DNA pre quality and post quality filtering. if you want to do that I need to add more flags. Point 3: I calculate cluster factor as: mappedPre / mappedPostD . What you are calculating I believe is mappedPost / mappedPostD. We should have a discussion about how to calculate this.

aidaanva commented 2 years ago

Comments from meeting 30/09/2022:

  1. raw --> 1
  2. QF --> ERROR
  3. Dedup --> 4 and 5
  4. raw + QF --> 1 and 2
  5. raw + Dedup --> 1,3,4,5
  6. QF + Dedup --> 4 and 5
  7. raw + QF + Dedup --> 1,2,3,4,5

Make a function for % target, clonality and one for % duplicates.

aidaanva commented 2 years ago

@jfy133 please test the changes on the cluster-factor branch

jfy133 commented 2 years ago

@aidaanva it would be easier if you could open a PR and I can comment there (I noticed a few typos, for example) but in the mean time:

I wonder if we should document somehwere that ALL %on target values are ALWAYS based on the raw reads into mapping, so endorSpy will never report with the dinnomator somewhere else? I guess you need to updat ethe README anyway