Request for Training and Testing Datasets

GfellerLab / MixTCRpred

Predictor of TCR-epitope interactions

Other

16 stars 5 forks source link

Request for Training and Testing Datasets #4

Closed HickeyTao closed 4 months ago

HickeyTao commented 4 months ago

Hi,

First of all, This project is really amazing! It's been incredibly helpful for me.

I have a small request: could you please provide the full datasets used for training and testing each model? This will greatly assist me in reproducing all of your results.

Thanks again for your hard work and dedication!

GiancarloCroce commented 4 months ago

Hi,

Thank you for your interest in our work! In the paper we performed several validations (5-fold-cross validation, leave-one-study out, leave-one-epitope out etc.) so I'm not sure what you mean by "full datasets used for training and testing each model".

On the GitHub page, we provide the file full_training_set_146pmhc.csv, which contains curated sequence data of experimentally validated TCR-epitope pairs (positive cases). Negative cases were generated in-silico using two methods:

Swapped negatives, where negative TCRs for a given pMHC are sampled from TCRs specific to other pMHCs.
Sampling TCRs from negative control datasets. You can find more details in the "Negative data" paragraph of the Methods section of the paper https://www.nature.com/articles/s41467-024-47461-8.

Hope this helps, Best, Giancarlo

HickeyTao commented 4 months ago

I apologize if I didn't express myself clearly. Here's what I mean: You train a model for each epitope, right? I saw the full_training_set_146pmhc.csv and the "Negative data" paragraph in the Methods section of the paper. However, different samplings of negative data can yield different results. Even with 5-fold cross-validation, this is done with sampled data. Therefore, what I mean to ask is, which negative data did you sample for each epitope when conducting your experiments? I would like to align with the results in your paper. If you have saved this data from your experiments, could you please share it with me?

By the way, I only saw instructions on how to use the tool in the "readme" section. If I have new data and want to retrain the model, how should I proceed? Is there any code available for that?

Thank you very much for your assistance.

GiancarloCroce commented 4 months ago

Hi,

Here is the procedure I follow to train a model for a specific epitope X:

Consider all TCRs specific to epitope X from the file "full_training_set_146pmhc.csv". Those are the positive cases.
Sample TCRs specific to other epitopes (swapped negatives) and sample TCRs from TCR repertoires of donors (TCRs with unknown specificity). Those are the negative cases.
Use such data to train a MixTCRpred model for the epitope X.

The reported AUCs are robust with respect to the specific set of negatives obtained through sampling. I don't store the training set data; it's generated as required.

Also, we do not provide code for retraining the model with new data.

Best, Giancarlo