Open LittletreeZou opened 8 months ago
To follow up on this maybe we can restrict the scope of the question to the consistency with the datasets.
In downloading the data I found: (Downloaded - Reported)
MLOS: 167 - 543 TC Riboswitches: 355 - 355 CoV Vaccine: 2400 - 2400 mRFP Expression: 1459 - 1459 Fungal Expression: 7089 - 7056 E. Coli Proteins: 6348 - 6,348 mRNA Stability: 65,356 - 41,123
It would be helpful if the authors clarified the length discrepancies between mRNA stability, Fungal Expression, and MLOS datasets.
Thanks a lot and congrats on the publication!
Thank you for integrating and opensource the Benckmark dataset. I noticed that there are some inconsistencies between statistics in the paper and the released data in
benchmarks/CodonBERT/data
. Here are the confusing parts:Could you kindly clarify them?
BTW, I noticed that some of the datasets are very small. When using a 0.7/0.15/0.15 split on such a small dataset and computing metrics like correlation, the results are not reliable. It would be better that you use k-fold cross validation.