Sanofi-Public / CodonBERT

Repository for mRNA Paper and CodonBERT publication.
Other
112 stars 18 forks source link

Questions about inconsistencies between the paper and the released data #3

Open LittletreeZou opened 8 months ago

LittletreeZou commented 8 months ago

Thank you for integrating and opensource the Benckmark dataset. I noticed that there are some inconsistencies between statistics in the paper and the released data in benchmarks/CodonBERT/data. Here are the confusing parts:

Could you kindly clarify them?

BTW, I noticed that some of the datasets are very small. When using a 0.7/0.15/0.15 split on such a small dataset and computing metrics like correlation, the results are not reliable. It would be better that you use k-fold cross validation.

phil-fradkin commented 4 months ago

To follow up on this maybe we can restrict the scope of the question to the consistency with the datasets.

In downloading the data I found: (Downloaded - Reported)

MLOS: 167 - 543 TC Riboswitches: 355 - 355 CoV Vaccine: 2400 - 2400 mRFP Expression: 1459 - 1459 Fungal Expression: 7089 - 7056 E. Coli Proteins: 6348 - 6,348 mRNA Stability: 65,356 - 41,123

It would be helpful if the authors clarified the length discrepancies between mRNA stability, Fungal Expression, and MLOS datasets.

Thanks a lot and congrats on the publication!