Dataset availability - Githubissues

sj584 commented 9 months ago

Hi, I would like to use your curated dataset for further neutralizability prediction.

processing/hiv_reg/dataset_hiv_reg.xlsx
processing/hiv_cls/dataset_hiv_cls.xlsx
processing/cov_cls/dataset_cov_cls.xlsx

As far as I understood, the dataprocessing steps are for data split, feature embedding (k-mer, PSSM .etc)

However, when I opened the dataset and compared it with the result section in your paper,

The numbers are not exactly fit.

based on the .csv file, I could get

hiv_reg -> total 14,996 / 2121 unseen
hiv_cls -> total 29,394 / 4551 unseen in paper, total 27,738 / 3301 unseen
covid-19 -> total 124,560 / 3939 unseen in paper, total around 4,000

Is there any further processing steps that I missed?

Above all, the Cov-AbDab dataset I could get upto this timepoint is 12,537 only.

In summary, I want your answers regarding...

Can I use the above .csv files as raw
If there's further steps of data processing other than feature embedding and splitting? such as reducing data redundancy
Why SARS-CoV2 dataset size is so different from the .csv file and the result section?
When I tried the
- python processing/hiv_cls/processing.py and opened the .pkl file. I could see feature embedding. But also with reduced size. why is that

Thank you in advance

stau-7001 commented 2 months ago

I encountered the same issues and would like to express my concerns as well. Additionally, I found that there are some completely identical samples in both the training and testing data of the COVID-19 dataset. Could you clarify this issue as well?

stzhangjie commented 2 months ago

Hi @sj584 and @stau-7001,

First, I want to express my gratitude for your polite interest in our previous work and your great questions about the data. :-) I cordially apologize for the delayed response. I left the company about 2.5 years ago, and my colleagues and I conducted most of the experiments around 3 years ago. Despite the challenging circumstances, I hope I can do my best to help answer your questions.

Regarding the HIV data: As far as I remember, the differences might be due to the different times when we conducted the experiments. We split the seen Abs’ instances, removed similar instances in the seen test set, and trained the models 20 times using 20 different random seeds. When different seeds are used, the numbers of Ab–Ag pairwise instances and unique Abs in the seen test set vary. Figure 3a shows the data information of seed 18.

As for the SARS-CoV2 data: I suppose that the gap in the data amount arises from the antigen variants rather than the antibodies from Cov-AbDab. The sequences of the SARS-CoV-2 variants are collected from the National Center for Biotechnology Information. The curated dataset includes the SARS-CoV-2 variants of the wild type, Alpha, Beta, Gamma, Delta, and Omicron. For each variant, the sequences of the different subvariants from different sources are different. Therefore, we randomly took 5 sequences for each variant except Omicron, for which we took all 11 sequences.

Once again, I sincerely appreciate your kind concern and apologize for my late replies. I hope that the above explanations can help you understand the data better. In the right part of page 9 of the paper, you might also kindly find similar descriptions. If you have other questions, you can easily reach me at stzhangjie@outlook.com. I will answer them as promptly and thoroughly as I can. Finally, thank you for your kind understanding of my difficult situation (as aforementioned).

Best regards,

enai4bio / DeepAAI

Dataset availability #4