Closed yutanagano closed 1 year ago
Thanks Yuta for improving the standardization!
Good idea to allow standardization of datasets, where the CDR3 definition only includes the junction but not the conserved flanking residues. Is the extension at the 3' end based on the V gene? While the conserved residue at the beginning is always a cysteine, there are multiple choices at the 3' end.
@andim
I'm not sure if I understood your question correctly, but tidytcells' junction standardisation is super rudimentary and almost exactly the same as your original code. Basically:
And I noticed that in your original code the final residue could be cysteine again, but I checked https://www.imgt.org/IMGTScientificChart/Nomenclature/IMGT-FRCDRdefinition.html and it seems the 3' is always either a F/W.
It turns out that in rare instances the final residue can also be a cysteine! I first encountered this in the Dash et al. data and went to the same IMGT definitions page and thought it should be an error. However, thanks to Jamie Heather (a lab alum from Benny's group) I learned that there are exceptions. Specifically, the human J gene TRAJ35: https://www.imgt.org/IMGTrepertoire/Proteins/alleles/index.php?species=Homo%20sapiens&group=TRAJ&gene=TRAJ35
My question relates to point 3: Always adding an F at the end works in the majority of cases, but not always. As there are three different options for the final amino acid (F/W/C), one might want to extend with one of those three residues using the J gene information (if available).
Hope that makes sense
I have merged the PR. We can separately work on a more precise way of extending junctional sequences. :)
Update standardize_dataframe with new and improved tidytcells