Questions about inconsistencies between the paper and the released data

Sanofi-Public / CodonBERT

Repository for mRNA Paper and CodonBERT publication.

Other

112 stars 18 forks source link

Thank you for integrating and opensource the Benckmark dataset. I noticed that there are some inconsistencies between statistics in the paper and the released data in benchmarks/CodonBERT/data. Here are the confusing parts:

For the MLOS flu vaccine data, you show 543 mRNA samples in Table 1 in the paper, but I only found 167 samples in the released data.
For SARS-Cov-2 vaccine degradation data, you show 2400 mRNA samples in Table 1 in the paper, but I only found 233 samples in the released data.

Could you kindly clarify them?

BTW, I noticed that some of the datasets are very small. When using a 0.7/0.15/0.15 split on such a small dataset and computing metrics like correlation, the results are not reliable. It would be better that you use k-fold cross validation.

Sanofi-Public / CodonBERT

Questions about inconsistencies between the paper and the released data #3