Closed acrinklaw closed 2 years ago
This commit should resolve issues with n identical database match sequences leading to n^2 rows in results (e.g. 5 matches with ASSLAPGATNEKLF became 25 rows in output). I implemented a find() function to check for a unique combination of input sequence, match sequence, score, and input sequence index. If it's unique, add it to the results vector; if it is not unique, (e.g. if it's a match with num. 2-5 of the 5 identical IEDB sequences), then skip it, since the first match will be expanded to 5 rows of results later on, using iedb_map.
I also moved the iedb_map generation below the argument parsing to ensure that the -d parameter is checked for a user-chosen path before iedb_map is generated. Previously, the script was using the default value of iedb_file, "data/IEDB_data.tsv", to create iedb_map, and using the user's database path for iedb_data, leading to empty output.
@wchronister @acrinklaw Tested. This is good to go.
@danielmarrama this should produce the same output that the Python script did before and clear up all issues at once. Unity has been achieved. Please give it a test and make sure things look ok. I did not spend too much time ensuring the minor details like fixing the readme, ensuring the table is in the right order, etc. The last thing I need to do now is to add a default output file, right now it writes to std out.