Cameron-Grey-Kunstadt / Demultiplex

0 stars 0 forks source link

Pseudocode Review #1

Open graceHach opened 1 month ago

graceHach commented 1 month ago

Hey Cam,

Excellent job defining the problem, and all the helper functions for this program! The logic of the code is very easy to follow. I believe the code meets all of the requirements, but I'm honestly unclear if the: "You should strive to report values for each possible pair of indexes (both swapped and dual matched)" is a requirement or just a suggestion. All but one of the functions (write_out_record) are reasonable and well defined. Likewise, I like that you have a set of reverse compliments of the barcodes instead of immediately generating the reverse compliment of read three, because you only need to create the set once. The ordering of your central conditional ladder is very logical.

That said, there are a few things your pseudocode is missing, some trivial some not.

The trivial: your pseudocode never initializes the counter variables, doesn't specify reading in anything other than the two indices, doesn't actually calculate the Phred scores, and doesn't specify how results are reported. I can still follow this because I'm already familiar with the problem, but referencing variables that have never been defined can be confusing.

The less trivial: You never really specify how the output files are opened, closed, and written to. I don't think it's a bad idea idea to pack the code for writing out a four line record into a function, but it's a little unclear how this works. Are all output files opened before the main iteration? How do you keep track of file handles? You should also definitely close the files when you're done, sometimes the writing won't show up if the files haven't been closed.

A suggestion: Since most of this code appends to files, it might be a good idea to clear all the files when you first begin writing to them so you don't end up accidentally doubling the content of a file. I don't have this in my pseudocode yet, but when you're running and re-running your code during the development stage, this can save you some time and headache.

Feel free to slack me if you have any questions or would like clarification.

Best, Grace

Bendycar commented 1 month ago

Hey Cam,

This pseudocode looks great, easy to follow and seems pretty thorough. Just a few things stood out to me:

1.) You mention writing unknown reads to "the appropriate unknown files", whereas for hopped reads you reference "index_hopped file". Just to be clear, unknown and unmatched reads should be stored in the same way -- with a two files for all unmatched reads in r1 and r2, and similarly for unknown. This is different from how the matched reads should be stored, which is a r1 and r2 file for each index. Make sure your code is doing this, not creating a separate file for each unmatched or unknown pair.

2.) Similar to what Grace said, I'm not sure if this is actually required or just "cherry on top", but your code is currently set up to return just one number for each of the three main groups (unmatched, unknown, and matched). Rather than this, I think you should report a count for every possible pair of indices.

3.) Pretty minor, but make sure you append the reverse complement of index 2 when you append the indices to header.

All in all, super solid and looks like you'll have a pretty easy time writing this up!

-Ben