explosion / healthsea

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.
MIT License
87 stars 17 forks source link

Minor change in the code #7

Closed shrinidhin closed 2 years ago

shrinidhin commented 2 years ago

Hi!I noticed that in the following line of code in the preprocess_clausecat.py file, at line 61 in the for loop while splitting the dataset into train and test set

for label in label_dict:
        split = int(len(label_dict[label]) * eval_split)
        train += label_dict[label][split:]
        dev += label_dict[label][:split]
        checksum += len(label_dict[label])
        table_data.append(
            (
                label,
                len(label_dict[label]),
                len(label_dict[label][split:]),
                len(label_dict[label][:split]),
            )
        )

the train and dev assignment statements need to be interchanged. As per the existing assignment, The train set has fewer samples than the dev set. Shouldn't it be the other way round? Something like this?

train += label_dict[label][:split]
dev += label_dict[label][split:]
thomashacker commented 2 years ago

Ah yes, I see why it can be a little confusing but I think the code seems right. Let's have a look at a small example:

split = int(len(label_dict[label]) * eval_split)

Let's say len(label_dict[label]) = 100 and eval_split = 0.2 (20%)

Then we'd get split = 20

for dev += label_dict[label][:20] we would get the first 20 elements (0->19) for train += label_dict[label][20:] we would get everything after the first 20 elements (20->len(label_dict[label])-1)

So this way we'd end up with a train (80%) and dev (20%) split.

shrinidhin commented 2 years ago

Okay. So eval_split should be the percentage of split for the dev set right?Meaning out of 100 % data, If I want the split to be 70:30, then i need to give a value of 0.3 for eval_split. Thank you!