Open jonorthwash opened 5 years ago
Hi!
The tests-tatcorpus directory contains just a list of word forms, that I collected from the Corpus of Written Tatar. They are not taken from any dictionary.
This list can be used to: 1) see effect of code changes; 2) collect words unknown to analyser and add it to .lexc file.
I'd like to keep it there so Ilnar also could use it. If you think it is better to remove it from repository, I will do it immediately.
Thank you!
@IlnarSelimcan, @ftyers, what do you two think? I think using something like this for regression testing is good, but I still have the licensing concern (maybe less than originally) and the size concern.
Size can be reduced 2 times, because not all of those files are necessary: one of them just backup, another can be generated.
This repo is already 360 MiB in size. It's not enough that you delete a file - it's still part of the cloned data. Anything you add is part of the repo's history forever. Those big files should be removed and purged from history with a rewrite.
Of the 145 repos I track, it's in the top 15 size-wise.
Ok, I understand, I will remove those files right now and please help me purging them from history.
I removed the files, but I don't know how to purge them from repo's history. @TinoDidriksen could you, please, help me with that?
@mansayk, which files are you planning on keeping / didn't remove?
Repository trimmed - now down to 54 MiB, which is manageable. Everyone will have to re-clone from scratch. I've taken a backup of the repo before doing the trim, just in case.
@TinoDidriksen thank you so much for your help!
@jonorthwash I will keep that test files locally and I will use it periodically. If I find any regression then I will create an issue(s) + add some new rules to existing tests, ok? If you have a better idea, please, let me know. Thank you.
I think I've found a better solution for this in 6dbcb196052d84cfbde21ed2d012c384608d243c . It seems to work, but improvements are welcome.
One particular thing that should be done is to split the frequency list into many and pass them through tat-morph in parallel (using GNU Parallel tool or something similar).
It's not clear to me that the corpus data in
tests-tatcorpus/
should stay there. Things I worry about:@mansayk, could you make an argument to justify keeping the tests in the repository?