AfshinLab / BLR

MIT License
5 stars 0 forks source link

Update cluster duplicate calling #39

Closed pontushojer closed 4 years ago

pontushojer commented 4 years ago

Fix https://github.com/FrickTobias/BLR/issues/218, fix https://github.com/FrickTobias/BLR/issues/229

Changes include:

Testrun

FASTQ = /proj/uppstore2018173/private/rawdata/190510.HiSeq.emTn5.Next.reseq_4.XIV-XV/XV.reseq_4.R2.fastq.gz

Check https://github.com/FrickTobias/BLR/issues/218 fixed

To check that https://github.com/FrickTobias/BLR/issues/218 is fixed I looked at the top barcodes that other barcodes are merged into in the find_clusterdups step. This by using the final.barcode-merges.csv file.

Old version

1434339 AAAAAGAAGGTCGTTCCTAG-1
    237 CAAGGTACTTTGGGACGGCA-1
    151 ATCACGAAGTCAGTCCTACC-1
    144 CAAATACATTTCTTCCGTAG-1
    130 AGCCGTAGTACCTAAACTTC-1
    128 CAAACGCCCGTCGGAGCTTC-1
    121 AATATGTATTCGGTTCTTAC-1
    116 ATTGGGAGGGCGTTAACTTG-1
    111 ATACTACAGATAGTTGTGTG-1
    110 CAAAGGCCTGTCGAAACATC-1

New version

   7238 AAAAGGTATGTAGAACGATC-1
   3628 AAACTTTCCTTGCTAATATA-1
   2337 AAAACGTGGGCATAACTAAG-1
   1853 AAAATTTGTGTACGAACAAC-1
   1812 AAAATTCCCGAAGTCCTAAG-1
   1782 AAACGATGCATGGTTATTAA-1
   1732 AAAGGGCCTTCGGTAGGGTC-1
   1587 AATCCAACGGTCCGTATGTA-1
   1520 AAACTGAGGATGGTTATTTG-1
   1434 AACGCACATGAGCTCCTTCA-1

From this it is clear that less barcodes are assigned to the top cluster in the new version.

Check https://github.com/FrickTobias/BLR/issues/229 fixed

To check that https://github.com/FrickTobias/BLR/issues/229 is fixed I collected runtime stats from the snakemake output log for the rule find_clusterdups for each chunk. The data was compiled into the graph below.

image

From this it is clear that runtime is shorter and more even as compared to chunk size.