Duplicate paragraphs in slv sample

TomazErjavec commented 5 years ago

In slv/SL-WIKI00024_sample.xml I noticed that several paragraphs are repeated, e.g p[@n="SL-WIKI000241093"]. This is probably not the intention?

mikekestemont commented 5 years ago

Good catch. pinging @lb42 : is this because because paragraph-n attributes are still unique when combined with the sample-level n-attribute?

lb42 commented 5 years ago

This is a consequence of the way the random sampling is done: please refer to the readme for details. The algorithm starts by choosing a paragraph at random from the whole text, and then adds more paragraphs till it has enough words. Every paragraph has an equal chance of getting selected as the first one each time and it's therefore entirely possible for paragraphs to get included more than once. Would you expect the second and subsequent random selection to be made only from paragraphs not previously selected? That would change the probability of selection each time (as there would be fewer and fewer candidate paragraphs) and introduce serious discontinuity headaches.

TomazErjavec commented 5 years ago

Would you expect the second and subsequent random selection to be made only from paragraphs not previously selected?

Absolutely!

That would change the probability of selection each time (as there would be fewer and fewer candidate paragraphs) and introduce serious discontinuity headaches.

I don't want to get into a deep statistic discussion (not capable of it, anyway), but I think "simple random sampling" does exactly that, i.e. you randomly take the first instance, do not return it to the pool, and then randomly take the second one from it etc. Like e.g. when conducting polls you don't call the same person twice.

On a practical level: it seems we should annotate NEs in the samples. So people will be annotating the same paragraphs twice, which is a waste of time. And a NER tool will get the same paragraphs twice for training, instead of getting different ones, so it will learn a less general model.

lb42 commented 5 years ago

Your interpretation is entirely valid, but it isn't the way I interpreted the instructions, such as they were. The situation is a little complicated because what we are actually getting is a sequence of paragraphs starting from a randomly selected point, which means overlap is slightly more likely, I think. But I am sure you wouldn't have wanted an entirely randomised selection of paragraphs! On a practical level, I see two strategies: one is to leave things as they are. Repetitions happen in language all the time! It's no big deal. The other is to just cut out the repeated paragraphs and live with a slightly smaller sample. (I think Diana chose this course for the Portuguese samples).

TomazErjavec commented 5 years ago

Repetitions happen in language all the time!

Sure, individual sentences. But here we have whole paragraphs, with substantial repetition. slv/SL-WIKI00024_sample.xml has 35 paragraphs, of which 3 are repeated - that's a lot!

just cut out the repeated paragraphs and live with a slightly smaller sample

As the third option was not mentioned, I'd go with this. So, do you mean cut them out in this repository? Pull request ok? I guess I can't push.

lb42 commented 5 years ago

You can push if you're a member of the contributors team. Try it!

mikekestemont commented 5 years ago

I would agree that removing these duplicates would make the most sense at this time, although we could also just repeat the sampling procedure, but without replacement. Tomaž, would you be willing to help with this? I can also help with recoding the sampling procedure. Let me know what you think, as I will have to inform the entire WG if we would change the sample data.

lb42 commented 5 years ago

I am happy to tweak the sampling procedure and rerun it, either for just the affected repos (slv and por), or for the whole shooting match if that's what you want. The tweaking would involve manual intervention though: at present I just get 4 starting positions at a time; the tweak would involve manually checking that the numbers are separated by at least say 10, on the assumption that I'd be sure of getting 4000 words within 10 paragraphs. So it wouldn't, in my book, be kosher random. Note also that there's an initial randomization to choose 20 titles from those available, so if I rerun the portuguese one, we'd get a completely different set of samples. On balance I think it's not worth the effort, but it's not my decision.

I also see I forgot to upload the scripts which do the job to the repo: they are there now (in WG2-Sample/Scripts)

dianamsmpsantos commented 5 years ago

Dear all, I have already manually annotated all the samples (for NER), so please do not rerun the sampling procedure from scratch! Either send one more paragraph, or let us just remove the repeated stuff. Diana

Lou notifications@github.com escreveu no dia segunda, 3/12/2018 à(s) 10:35:

I am happy to tweak the sampling procedure and rerun it, either for just the affected repos (cze and por), or for the whole shooting match if that's what you want. The tweaking would involve manual intervention though: at present I just get 4 starting positions at a time; the tweak would involve manually checking that the numbers are separated by at least say 10, on the assumption that I'd be sure of getting 4000 words within 10 paragraphs. So it wouldn't, in my book, be kosher random. Note also that there's an initial randomization to choose 20 titles from those available, so if I rerun the portuguese one, we'd get a completely different set of samples. On balance I think it's not worth the effort, but it's not my decision.

I also see I forgot to upload the scripts which do the job to the repo: they are there now (in WG2-Sample/Scripts)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/COST-ELTeC/WG2-Sample/issues/1#issuecomment-443645862, or mute the thread https://github.com/notifications/unsubscribe-auth/AHx-col3zsCkk8gFYu1vMKxCPO1lWQrSks5u1PBXgaJpZM4Y9hqu .

mikekestemont commented 5 years ago

Aha, I was afraid that they might be the case. Thanks for signalling this @dianamsmpsantos . Let us then just remove the duplicated items!

TomazErjavec commented 5 years ago

Tomaž, would you be willing to help with this?

"This" being now removing of duplicate items I guess: yes, I can write a script to do it.

TomazErjavec commented 5 years ago

Wrote the script (c401d21fb851c60df56297d2b36bc7e46023a16c) and applied it to slv (67b0f4f497b1257f3b67781d0beba588eb807f30). Didn't do it for other languages, probably safer if authors do it themselves.

COST-ELTeC / WG2-Sample

Duplicate paragraphs in slv sample #1