Closed ayushjainrksh closed 5 years ago
Haven't integrated it to make yet but I am gonna do that soon
I have added TSV sources for all files. Any idea why does it fail for python:nightly ?
Probably just a flaky failure.
Do I need to worry about that or should I continue to write makefile for automatic build?
Just keep going.
Okay I will rename it.
langNames.db
is not a dependency. Why should we pass it in command line arguements?
Okay I will rename it.
langNames.db
is not a dependency. Why should we pass it in command line arguements?
Then it doesn't have to be hardcoded here: https://github.com/apertium/apertium-apy/pull/131/files#diff-eb0c3cfbbb1dab4de6647380dec07fe0R22.
what do I need to change here?
Would that suffice?
Also, we should force sorting. This should be possible by running each file through sort
and verifying the diff
is empty. If not, fail build.
I have made the necessary changes except the issue with indexes. The indexes are sorted but uneven. That means many of the indexes are missing. Check this out.
No one cares about the id in this case. Also, the word index means something completely different in database nomenclature.
This diff still doesn't handle the scripts that scrape SIL and CLDR. They need to be modified to output a TSV.
This diff still doesn't handle the scripts that scrape SIL and CLDR. They need to be modified to output a TSV.
I am working on this.
I have made all the changes you suggested and added scraped data from SIL and CLDR files to tsv format. I have also fixed a bug in scraper.py . Please check if there is anything else that needs to be changed.
Also, please update the README. It currently references the SQL files.
Sure
I have pushed some changes. Please give a look.
I have fixed the issues and number of lines changed also looks fine now.
Everything seems fine in terms of number of lines except for scraped-cldr.
v.s.
e.g. the Telugu language names seem missing in the newer version. Any thoughts?
I have scraped this data using scraper.tsv. These were the changes I was talking about (the reason I used append initially). There are some fails when I run scraper.py . Some of the pairs are not fetched and give error
Hmm, okay. I can try looking into it if you don't see the issue.
Regardless, appending is not the solution.
Failed to retrieve language kaa, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/kaa.xml': failed to load HTTP resource
There are more such fails
That particular one is expected. te.xml
should have worked though.
See http://www.unicode.org/repos/cldr/tags/latest/common/main/ for what should work.
I think its not included
sushain: ayushjain[m]: what do you mean it isn’t included? sushain: ayushjain[m]: do you mean it’s erroring? if so, why? are we sending the requests too fast?
html_tools_languages = set(map(to_alpha2_code, {
'arg', 'heb', 'cat', 'sme', 'deu', 'eng', 'eus', 'fra', 'spa', 'ava', 'nno',
'nob', 'oci', 'por', 'kaz', 'kaa', 'kir', 'ron', 'rus', 'fin', 'tat', 'tur',
'uig', 'uzb', 'zho', 'srd', 'swe',
}))
These are the ones included. Does any of these stand for te.xml ?
Found 30 apertium languages: eu, an, kaa, av, ky, pt, fr, nn, se, hr, ru, fi, bs, he, tr, tt, es, en, zh, sc, sr, sv, ug, ro, de, kk, uz, nb, oc, ca.
te
doesn't show up here
Failed to retrieve language an, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/an.xml': failed to load HTTP resource
Failed to retrieve language kaa, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/kaa.xml': failed to load HTTP resource
Failed to retrieve language av, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/av.xml': failed to load HTTP resource
Failed to retrieve language sc, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/sc.xml': failed to load HTTP resource
Failed to retrieve language oc, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/oc.xml': failed to load HTTP resource
These are the errors
All the files are now sorted
Hmm, there are some issues with the way the Apertium languages are fetched. I will take of correcting those after your changes are complete.
I have updated scraper files to use DictWriter and sort the files.
Added TSV source files to be converted at the build time by manual.py