fake-news-detector / api

API for saving news flagging by the users
https://fake-news-detector-api.herokuapp.com/
8 stars 1 forks source link

Add timestamp parameter for reproducibility in /links endpoint #24

Closed ffmmjj closed 6 years ago

ffmmjj commented 6 years ago

Right now, every time the models from Robinho are trained, the corresponding scripts query the links/all endpoint to retrieve all the news examples currently stored in this API and use this data to train the models. This implies, however, that if a previous job from Robinho is run to re-generate a previous version of the predictive model, it will use a different dataset from the one used in that version, thus preventing reproducibility.

One way to ensure that the same dataset is used across re-executions of a single build is to make the /links endpoint receive an optional parameter like until=2017-10-20T17:21:31Z to specify what version of the data should be returned (in this case, it would return all the data entries that were added until the passed datetime). The job in Robinho would then pass this parameter in its query to make sure that it will always use the same dataset when it trains a model in a specific job.

rogeriochaves commented 6 years ago

well, right now the data is publicly available anyway, what about getting an snapshot for running the tests and committing it? It would be easier than add a timestamp to all database inputs and pass it to all the queries too

rogeriochaves commented 6 years ago

I've done that in https://github.com/fake-news-detector/robinho/pull/8 Also added random_state to train_test_split, now we got full reproducibility! Every time I run the tests I get the same results 💯

Thank you very much for focusing on this, I didn't even thought it was possible