Open MJedr opened 1 year ago
Also, we might consider dataset versioning systems to keep track of changes.
Today, we worked on filtering and making the test with embed-dings of the documents.
We have 5 most similar docs for each document in the dataset in spreadsheet so curators can have a look and validate the results.
For our MVP, we need to prepare a dataset. The dataset should be composed of 50k documents selected based on
inspire_categories
anddate_added
(stratified sample). The documents should contain a title and abstract (concatenated as<|title|> title. <|abstract|> abstract
). We can start with stripping mathml from both title and abstract for cleaning. The dataset should be saved in a binary format and uploaded to s3.