cern-sis / issues

0 stars 0 forks source link

Create a dataset #86

Open MJedr opened 1 year ago

MJedr commented 1 year ago

For our MVP, we need to prepare a dataset. The dataset should be composed of 50k documents selected based on inspire_categories and date_added (stratified sample). The documents should contain a title and abstract (concatenated as <|title|> title. <|abstract|> abstract). We can start with stripping mathml from both title and abstract for cleaning. The dataset should be saved in a binary format and uploaded to s3.

MJedr commented 1 year ago

Also, we might consider dataset versioning systems to keep track of changes.

ParthS007 commented 1 year ago

Today, we worked on filtering and making the test with embed-dings of the documents.

We have 5 most similar docs for each document in the dataset in spreadsheet so curators can have a look and validate the results.