CUNY-CL / wikipron

Massively multilingual pronunciation mining
Apache License 2.0
320 stars 71 forks source link

wikipron/data/src reorganization #360

Closed lfashby closed 3 years ago

lfashby commented 3 years ago

wikipron/data/src is host to a lot of different files that do a lot of different things in different places. The unorganized state of wikipron/data/src may be off-putting to new/prospective contributors who don’t want to wade into this large collection of files to try and get their bearings.

I have a few suggestions for things we can do to reorganize wikipron/data/src, though I’d also like suggestions for how (or even if) wikipron/data/src should be reorganized from those who contributed some of these files.

A few things come to mind immediately:

I’m not sure if putting things in subdirectories of src is a great solution, so any alternatives would be welcome.

kylebgorman commented 3 years ago

wikipron/data/src is host to a lot of different files that do a lot of different things in different places. The unorganized state of wikipron/data/src may be off-putting to new/prospective contributors who don’t want to wade into this large collection of files to try and get their bearings.

I have a few suggestions for things we can do to reorganize wikipron/data/src, though I’d also like suggestions for how (or even if) wikipron/data/src should be reorganized from those who contributed some of these files.

I agree this needs to be reorganized. It's a mess.

A few things come to mind immediately:

  • We wrote postprocess but for whatever reason only call split.py from within it. We should call generate_tsv_summary.py from postprocess as well and maybe put both split.py and generate_tsv_summary.py in an src/postprocess_helpers subdirectory.

+1.

The things people need to call: scrape.py and postprocess should live in some top-level directory, put everything else in a subdirectory maybe.

  • All of our .json files should go in a subdirectory (or at least all of them that are not languages.json).

I agree these should be in a subdirectory but I don't htink it should be organized around their JSONness. I have a proposal below.

  • The “Big Scrape” Scripts README is out of date, we no longer retry languages 10 times.

+1, easy fix.

  • The scripts which have to do with the phones stuff should be in there own subdirectory? There are quite a few of them now.
  • covering_grammar.py seems a bit homeless. I don’t think we’ve used it.

I think that in general the phones directory is poorly organized too, since GitHub still shows the generated README after the long list of .phones files.

I’m not sure if putting things in subdirectories of src is a great solution, so any alternatives would be welcome.

Here's a worked proposal for data/:

README.md  # manually generated, with links to the immediate subdirectories
frequencies/*  # as is
morphology/*  # as Reuben is currently developing
phones/README.md  # autogenerated
phones/postprocess  # new, but just calls whatever you do after you update phones files
phones/lib/*.py  # everything else
phones/phones/*.phones
scrape/scrape.py
scrape/postprocess
scrape/README.md  # autogenerated
scrape/lib/*.py  # everything but scrape.py
scrape/lib/*.json
scrape/tsv/*.tsv
kylebgorman commented 3 years ago

Sidebar: I've also asked the error analysis team to begin merging their code and data into data/, so that will be a test case for whatever proposals...

lfashby commented 3 years ago

The proposal makes sense to me, it’ll be a pain to move these files around and make sure everything works, but I think it’s worth it.

I suppose in phones/postprocess we can call generate_phones_summary.py and maybe normalize.py? I think those are the only relevant steps.

Should we break this up into smaller tickets or just assign ourselves to sections here? I can tackle the scrape stuff tonight, although maybe it'd be better to hold off until after the covering grammar stuff gets merged.

kylebgorman commented 3 years ago

normalize.py I think goes in phones/lib since it's not needed except to fix problems. I would just make sure there's a postprocess next to each autogenerated README...

If you want to try to break it up into smaller tickets go ahead. I'm terrible at that. Feel free to assign some of it to me.

I think you can do this independently of the CG work, since I asked Arundhati to take her time.

kylebgorman commented 3 years ago

Now that #364 is in this can go through, methinks.

lfashby commented 3 years ago

Aidan and I are both going to move some stuff around. I won't have time to work on this until Thursday.

kylebgorman commented 3 years ago

Should I close the bug @ajmalanoski @lfashby?

lfashby commented 3 years ago

There are still a few things left to do. I've got to fix some paths in the big scrape README and we need to take care of the covering grammar/error analysis files. Ultimately I think we want to get rid of data/src entirely. Also we've got to write up data/README.md.

kylebgorman commented 3 years ago

SGTM.

On Fri, Mar 26, 2021 at 12:22 PM Lucas Ashby @.***> wrote:

There are still a few things left to do. I've got to fix some paths in the big scrape README and we need to take care of the covering grammar/error analysis files. Ultimately I think we want to get rid of data/src entirely.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/360#issuecomment-808346166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OKSEEJQOLHTOD7KBJTTFSYDLANCNFSM4XZR3Y4Q .