asreview / synergy-dataset

SYNERGY - Open machine learning dataset on study selection in systematic reviews
Creative Commons Zero v1.0 Universal
62 stars 27 forks source link

Add Valk (2021) dataset #85

Closed gimoAI closed 2 years ago

gimoAI commented 2 years ago

Add Valk 2022 dataset and code for processing.

J535D165 commented 2 years ago

I think more clean and extensible. No for loops, no asreview dep, less code.

import pandas as pd

df = pd.read_excel("https://osf.io/download/gmjcv/", usecols=['DOI', 'Included_fulltext'])

# adjust columns
df["DOI"] = df["DOI"].str.extract(r"(10.\S+)")
df['id_type'] = 'doi'

# rename columns
df.rename({
    'Included_fulltext': 'label_included',
    'DOI': 'id'
}, axis=1, inplace=True)

# drop missing ids
df.dropna(subset=["id"], inplace=True)

# export
df.to_csv("Valk_2021_ids.csv", columns=['id', 'id_type', 'label_included'], index=False)
gimoAI commented 2 years ago

Nice, inplace should be avoided right?

J535D165 commented 2 years ago

I dont think you have to avoid inplace here. But with Asreview data objects this can have side effects.

J535D165 commented 2 years ago

Nice!