FrickTobias / BLR

MIT License
6 stars 5 forks source link

Chunk specific runtime artefacts in clusterrmdup #229

Closed pontushojer closed 4 years ago

pontushojer commented 4 years ago

The runtime for the clusterrmdup (find_clusterdups after PR https://github.com/NBISweden/BLR/pull/30 merges) step is much longer for chrY than other chromosomes. See the following graphs below generated based on data in /proj/uppstore2018173/private/pontus/runs/200819._synchronise-merges_rerun. I have also seen this phenomenon in other runs.

Screenshot 2020-09-01 at 14 39 12

Runtime vs mean coverage for each chromosome. For chrM this is the sum of all small contigs that make up this "chunk".

Screenshot 2020-09-01 at 14 39 30

Runtime vs total contig length for each chromosome. For chrM this is the sum of all small contigs that make up this "chunk".

From these figures it is clear that chrY for some reason take longer that should be predicted based on coverage and contig length. Chr16 also somewhat breaks this pattern.

What could be the reason for this??

marcelm commented 4 years ago

chr1 also doesn’t follow the pattern. Or is this an artifact?

pontushojer commented 4 years ago

Yeah its true chr1 also takes longer than expected. It was however not as striking as chrY but still this might relate to a shared issue.

pontushojer commented 4 years ago

I did a check for a separate dataset and compared against all other rules that are run for the chunks. Looking at this we see that chr1, chr16, chr21 (somewhat less though) and chrY stick out from the rest (see the red trace for clusterrmdup).

Screenshot 2020-09-03 at 17 50 42