corpus data in tests-tatcorpus

apertium / apertium-tat

Apertium linguistic data for Tatar

GNU General Public License v3.0

4 stars 3 forks source link

corpus data in tests-tatcorpus #34

Open jonorthwash opened 5 years ago

jonorthwash commented 5 years ago

It's not clear to me that the corpus data in tests-tatcorpus/ should stay there. Things I worry about:

Licensing of the data: where's it from / is the license compatible with the rest of the repository.
Size: the data is pretty big, whereas most of the rest of the repo is not.
Relevance: this repo is for the analyser (and tests to maintain it, offshoots, experiments, etc.), not large corpus data.

@mansayk, could you make an argument to justify keeping the tests in the repository?

mansayk commented 5 years ago

Hi!

The tests-tatcorpus directory contains just a list of word forms, that I collected from the Corpus of Written Tatar. They are not taken from any dictionary.

This list can be used to: 1) see effect of code changes; 2) collect words unknown to analyser and add it to .lexc file.

I'd like to keep it there so Ilnar also could use it. If you think it is better to remove it from repository, I will do it immediately.

Thank you!

jonorthwash commented 5 years ago

@IlnarSelimcan, @ftyers, what do you two think? I think using something like this for regression testing is good, but I still have the licensing concern (maybe less than originally) and the size concern.

mansayk commented 5 years ago

Size can be reduced 2 times, because not all of those files are necessary: one of them just backup, another can be generated.

TinoDidriksen commented 5 years ago

This repo is already 360 MiB in size. It's not enough that you delete a file - it's still part of the cloned data. Anything you add is part of the repo's history forever. Those big files should be removed and purged from history with a rewrite.

Of the 145 repos I track, it's in the top 15 size-wise.

mansayk commented 5 years ago

Ok, I understand, I will remove those files right now and please help me purging them from history.

mansayk commented 5 years ago

I removed the files, but I don't know how to purge them from repo's history. @TinoDidriksen could you, please, help me with that?

jonorthwash commented 5 years ago

@mansayk, which files are you planning on keeping / didn't remove?

TinoDidriksen commented 5 years ago

Repository trimmed - now down to 54 MiB, which is manageable. Everyone will have to re-clone from scratch. I've taken a backup of the repo before doing the trim, just in case.

mansayk commented 5 years ago

@TinoDidriksen thank you so much for your help!

@jonorthwash I will keep that test files locally and I will use it periodically. If I find any regression then I will create an issue(s) + add some new rules to existing tests, ok? If you have a better idea, please, let me know. Thank you.

IlnarSelimcan commented 5 years ago

I think I've found a better solution for this in 6dbcb196052d84cfbde21ed2d012c384608d243c . It seems to work, but improvements are welcome.

IlnarSelimcan commented 5 years ago

One particular thing that should be done is to split the frequency list into many and pass them through tat-morph in parallel (using GNU Parallel tool or something similar).