Fix for issue #115 to add TSV source files

ayushjainrksh commented 5 years ago

Added TSV source files to be converted at the build time by manual.py

coveralls commented 5 years ago

Coverage remained the same at 45.16% when pulling fdefcfbe4639b22a1653ba80d6ac6e4605028717 on ayushjainrksh:issue115 into 7fc77432ba1953083a36f81b1945d3e259c840f1 on apertium:master.

ayushjainrksh commented 5 years ago

Haven't integrated it to make yet but I am gonna do that soon

ayushjainrksh commented 5 years ago

I have added TSV sources for all files. Any idea why does it fail for python:nightly ?

sushain97 commented 5 years ago

Probably just a flaky failure.

ayushjainrksh commented 5 years ago

Do I need to worry about that or should I continue to write makefile for automatic build?

sushain97 commented 5 years ago

Just keep going.

ayushjainrksh commented 5 years ago

Okay I will rename it. langNames.db is not a dependency. Why should we pass it in command line arguements?

sushain97 commented 5 years ago

Okay I will rename it. langNames.db is not a dependency. Why should we pass it in command line arguements?

Then it doesn't have to be hardcoded here: https://github.com/apertium/apertium-apy/pull/131/files#diff-eb0c3cfbbb1dab4de6647380dec07fe0R22.

ayushjainrksh commented 5 years ago

what do I need to change here?

ayushjainrksh commented 5 years ago

Would that suffice?

sushain97 commented 5 years ago

Also, we should force sorting. This should be possible by running each file through sort and verifying the diff is empty. If not, fail build.

ayushjainrksh commented 5 years ago

I have made the necessary changes except the issue with indexes. The indexes are sorted but uneven. That means many of the indexes are missing. Check this out. Screenshot from 2019-03-26 03-04-45

sushain97 commented 5 years ago

No one cares about the id in this case. Also, the word index means something completely different in database nomenclature.

sushain97 commented 5 years ago

This diff still doesn't handle the scripts that scrape SIL and CLDR. They need to be modified to output a TSV.

ayushjainrksh commented 5 years ago

This diff still doesn't handle the scripts that scrape SIL and CLDR. They need to be modified to output a TSV.

I am working on this.

ayushjainrksh commented 5 years ago

I have made all the changes you suggested and added scraped data from SIL and CLDR files to tsv format. I have also fixed a bug in scraper.py . Please check if there is anything else that needs to be changed.

sushain97 commented 5 years ago

Also, please update the README. It currently references the SQL files.

ayushjainrksh commented 5 years ago

Sure

ayushjainrksh commented 5 years ago

I have pushed some changes. Please give a look.

ayushjainrksh commented 5 years ago

I have fixed the issues and number of lines changed also looks fine now.

sushain97 commented 5 years ago

Everything seems fine in terms of number of lines except for scraped-cldr.

https://github.com/apertium/apertium-apy/blob/765c6bed0883b16ed92f26aa1af8dfeaa973405c/language_names/scraped-cldr.tsv

v.s.

https://github.com/apertium/apertium-apy/blob/7fc77432ba1953083a36f81b1945d3e259c840f1/language_names/scraped.sql

e.g. the Telugu language names seem missing in the newer version. Any thoughts?

ayushjainrksh commented 5 years ago

I have scraped this data using scraper.tsv. These were the changes I was talking about (the reason I used append initially). There are some fails when I run scraper.py . Some of the pairs are not fetched and give error

sushain97 commented 5 years ago

Hmm, okay. I can try looking into it if you don't see the issue.

Regardless, appending is not the solution.

ayushjainrksh commented 5 years ago

Failed to retrieve language kaa, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/kaa.xml': failed to load HTTP resource

There are more such fails

sushain97 commented 5 years ago

That particular one is expected. te.xml should have worked though.

See http://www.unicode.org/repos/cldr/tags/latest/common/main/ for what should work.

ayushjainrksh commented 5 years ago

I think its not included

sushain97 commented 5 years ago

sushain: ayushjain[m]: what do you mean it isn’t included? sushain: ayushjain[m]: do you mean it’s erroring? if so, why? are we sending the requests too fast?

ayushjainrksh commented 5 years ago

html_tools_languages = set(map(to_alpha2_code, {
    'arg', 'heb', 'cat', 'sme', 'deu', 'eng', 'eus', 'fra', 'spa', 'ava', 'nno',
    'nob', 'oci', 'por', 'kaz', 'kaa', 'kir', 'ron', 'rus', 'fin', 'tat', 'tur',
    'uig', 'uzb', 'zho', 'srd', 'swe',
}))

These are the ones included. Does any of these stand for te.xml ?

ayushjainrksh commented 5 years ago

Found 30 apertium languages: eu, an, kaa, av, ky, pt, fr, nn, se, hr, ru, fi, bs, he, tr, tt, es, en, zh, sc, sr, sv, ug, ro, de, kk, uz, nb, oc, ca.

te doesn't show up here

ayushjainrksh commented 5 years ago

Failed to retrieve language an, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/an.xml': failed to load HTTP resource

Failed to retrieve language kaa, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/kaa.xml': failed to load HTTP resource

Failed to retrieve language av, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/av.xml': failed to load HTTP resource

Failed to retrieve language sc, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/sc.xml': failed to load HTTP resource

Failed to retrieve language oc, exception: Error reading file 'http://www.unicode.org/repos/cldr/tags/latest/common/main/oc.xml': failed to load HTTP resource

These are the errors

ayushjainrksh commented 5 years ago

All the files are now sorted

sushain97 commented 5 years ago

Hmm, there are some issues with the way the Apertium languages are fetched. I will take of correcting those after your changes are complete.

ayushjainrksh commented 5 years ago

I have updated scraper files to use DictWriter and sort the files.

apertium / apertium-apy

Fix for issue #115 to add TSV source files #131