IEDB / TCRMatch

Other
26 stars 12 forks source link

Add output file logic #25

Closed acrinklaw closed 2 years ago

acrinklaw commented 2 years ago

Fixes #21 by adding new output file parameter, defaulting to CSV format and a file named "output.csv"

acrinklaw commented 2 years ago

This also adds "doctest.h" which I think looks promising as a very simple, lightweight testing framework

https://github.com/onqtam/doctest

danielmarrama commented 2 years ago

The process_output.py script will need to also take in .csv files as input to get more metadata. Maybe also consider linking the script to the output so it runs when a flag is given? Not sure what the right move is here. Maybe needs to be simplified for the user.

schristley commented 2 years ago

In our use case on VDJServer, we run tcrmatch on large AIRR TSV files. I provided PR #27 to support the input format, but the output loses context of which rearrangement the CDR3 came from. Essentially have to do a join/match between the original AIRR TSV and the tcrmatch output. That mostly works when CDR3s were unique but there is the rare edge case where two rearrangements could have the same CDR3 in the AIRR TSV, in that case it's harder to match up the tcrmatch output with the correct rearrangement record. What would be nice is if the sequence_id was in the tcrmatch output, then that would be unambiguous.

schristley commented 2 years ago

One suggestion is for the tcrmatch output to have headers by default, and then there could be different outputs with different headers if need be (That is, I'm thinking with an AIRR TSV input it adds a sequence_id field in the output). Because python trivially handles CSV files, and doesn't care about the order of columns when you have headers, the process_output.py script can easily handle the variable output formats.

acrinklaw commented 2 years ago

That's a good point, and something that we've actually run into for the IEDB data as well (that is, having non-unique CDR3 sequences and how to handle them). I think overall this is highlighting the need to unify the two scripts - at the time of publication I was strictly focused on creating a tool that was as efficient speed wise as possible and admittedly didn't focus on user friendliness. I think unifying the two and upping the support for AIRR format should be a priority