andim / pyrepseq

Python library for immune repertoire analysis
MIT License
15 stars 4 forks source link

Io update #5

Closed yutanagano closed 1 year ago

yutanagano commented 1 year ago

Update standardize_dataframe with new and improved tidytcells

andim commented 1 year ago

Thanks Yuta for improving the standardization!

Good idea to allow standardization of datasets, where the CDR3 definition only includes the junction but not the conserved flanking residues. Is the extension at the 3' end based on the V gene? While the conserved residue at the beginning is always a cysteine, there are multiple choices at the 3' end.

yutanagano commented 1 year ago

@andim

I'm not sure if I understood your question correctly, but tidytcells' junction standardisation is super rudimentary and almost exactly the same as your original code. Basically:

  1. Is it a valid amino acid sequence?
  2. Does it start with a cysteine and end with a phenylalanine or tryptophan?
  3. If not add cysteine at beginning and tryptophan at end (this default behaviour can be changed though by setting strict=True, in which case these will simply be rejected)

And I noticed that in your original code the final residue could be cysteine again, but I checked https://www.imgt.org/IMGTScientificChart/Nomenclature/IMGT-FRCDRdefinition.html and it seems the 3' is always either a F/W.

andim commented 1 year ago

It turns out that in rare instances the final residue can also be a cysteine! I first encountered this in the Dash et al. data and went to the same IMGT definitions page and thought it should be an error. However, thanks to Jamie Heather (a lab alum from Benny's group) I learned that there are exceptions. Specifically, the human J gene TRAJ35: https://www.imgt.org/IMGTrepertoire/Proteins/alleles/index.php?species=Homo%20sapiens&group=TRAJ&gene=TRAJ35

My question relates to point 3: Always adding an F at the end works in the majority of cases, but not always. As there are three different options for the final amino acid (F/W/C), one might want to extend with one of those three residues using the J gene information (if available).

Hope that makes sense

andim commented 1 year ago

I have merged the PR. We can separately work on a more precise way of extending junctional sequences. :)