ddiem-ri-4D / epiTCR

epiTCR: a highly sensitive predictor for TCR–peptide binding
https://github.com/ddiem-ri-4D/epiTCR
MIT License
8 stars 4 forks source link

About using data sets TCR sequences almost never start with C and end with F #6

Open yanpinlu opened 10 months ago

yanpinlu commented 10 months ago

Most of the TCR sequences downloaded from the databases I reviewed are beginning with C and ending with F, and some articles also mentioned that this is more in line with the characteristics of CDR3 sequences. Why are almost all TCR datasets beginning with A and ending with uncertainty in your dataset? @nttvy

nttvy commented 10 months ago

@yanpinlu We learned about this preprocessing step from the NetTCR paper. In general, if all the sequences have the same beginning and ending motif, removing those characters would not generate any problems in the learning process but decrease the dimension of the data.

yanpinlu commented 10 months ago

@nttvy Thank you very much for your answer, but I can't find the processing of CDR3 sequence in netTCR in the code, forgive my carelessness, could you mark it for me? Also I would like to ask about the negative data set downloaded from 10X is collected from all published? The epitope that does not combine with it is introduced in the article. Or random matching from the epitopes of VDJdb,IEDB, etc.?
From the test set without MHC, I found that the negative data set of test01 to test15 matched only 50 kinds of epitopes, while the positive data set matched more than 400 kinds of epitopes. Will this affect the prediction results?

ddiem-ri-4D commented 10 months ago

Hi @yanpinlu

The data used in the NetTCR code has already been preprocessed, so you won't find how the data was processed in the code. The training and testing data shown in NetTCR have been cleaned as explained in the research paper. They do not provide the code for the data preprocessing steps; they only provide the final data to be used directly for training the model.

Thank you for your interest.

Best regards, My Diem

nttvy commented 10 months ago

@ddiem-ri-4D thank you for having clarified the information @yanpinlu For the first part of the epiTCR paper, the negative data was collected from the 10X, under the project "Application Note - A New Way of Exploring Immunity". For the last part of that paper, we generated additional negative data from human wildtype sequences, as described in the paper and in the supplementary materials. In the paper, we also showed that parts of the prediction on unseen peptides relied on the positive and negative labels of similar sequences learned in the training set. For that reason, we introduced a specific training set and test set organization for the last part of the paper.

yanpinlu commented 10 months ago

Hi @yanpinlu

The data used in the NetTCR code has already been preprocessed, so you won't find how the data was processed in the code. The training and testing data shown in NetTCR have been cleaned as explained in the research paper. They do not provide the code for the data preprocessing steps; they only provide the final data to be used directly for training the model.

Thank you for your interest.

Best regards, My Diem

Thank you very much for your patience

yanpinlu commented 10 months ago

@ddiem-ri-4D thank you for having clarified the information @yanpinlu For the first part of the epiTCR paper, the negative data was collected from the 10X, under the project "Application Note - A New Way of Exploring Immunity". For the last part of that paper, we generated additional negative data from human wildtype sequences, as described in the paper and in the supplementary materials. In the paper, we also showed that parts of the prediction on unseen peptides relied on the positive and negative labels of similar sequences learned in the training set. For that reason, we introduced a specific training set and test set organization for the last part of the paper.

Thank you very much for your patience