emybart415 / Deduper-ebart

0 stars 0 forks source link

Deduper

Part 1

Use this repo template to create your own Deduper repo - you should do all your work in your own repository. Please name it Deduper-<github-user-name>.

Write up a strategy for writing a Reference Based PCR Duplicate Removal tool. That is, given a sorted sam file of uniquely mapped reads, remove all PCR duplicates (retain only a single copy of each read). Develop a strategy that avoids loading everything into memory. You should not write any code for this portion of the assignment. Be sure to:

For this portion of the assignment, you should design your algorithm for single-end data, with 96 UMIs. UMI information will be in the QNAME, like so: NS500451:154:HWKTMBGXX:1:11101:15364:1139:GAACAGGT. Discard any UMIs with errors (or think about how you might error correct, if you're feeling ambitious).

Part 2

An important part of writing code is reviewing code - both your own and other's. In this portion of the assignment, you will be assigned 3 students' pseudocode algorithms to review. Be sure to evaluate the following points:

You can find your assigned reviewees on Canvas. You can find your fellow students' repositories at

github.com/<user>/Deduper-<github-user-name>

Be sure to leave comments on their repositories by creating issues or by commenting on the pull request.

Part 3

Write your deduper function!

Given a SAM file of uniquely mapped reads, and a text file containing the known UMIs, remove all PCR duplicates (retain only a single copy of each read). Remember:

You MUST: