Closed lfashby closed 3 years ago
wikipron/data/src
is host to a lot of different files that do a lot of different things in different places. The unorganized state ofwikipron/data/src
may be off-putting to new/prospective contributors who don’t want to wade into this large collection of files to try and get their bearings.I have a few suggestions for things we can do to reorganize
wikipron/data/src
, though I’d also like suggestions for how (or even if)wikipron/data/src
should be reorganized from those who contributed some of these files.
I agree this needs to be reorganized. It's a mess.
A few things come to mind immediately:
- We wrote
postprocess
but for whatever reason only callsplit.py
from within it. We should callgenerate_tsv_summary.py
frompostprocess
as well and maybe put bothsplit.py
andgenerate_tsv_summary.py
in ansrc/postprocess_helpers
subdirectory.
+1.
The things people need to call: scrape.py
and postprocess
should live in some top-level directory, put everything else in a subdirectory maybe.
- All of our
.json
files should go in a subdirectory (or at least all of them that are notlanguages.json
).
I agree these should be in a subdirectory but I don't htink it should be organized around their JSONness. I have a proposal below.
- The “Big Scrape” Scripts README is out of date, we no longer retry languages 10 times.
+1, easy fix.
- The scripts which have to do with the phones stuff should be in there own subdirectory? There are quite a few of them now.
covering_grammar.py
seems a bit homeless. I don’t think we’ve used it.
I think that in general the phones
directory is poorly organized too, since GitHub still shows the generated README after the long list of .phones files.
I’m not sure if putting things in subdirectories of
src
is a great solution, so any alternatives would be welcome.
Here's a worked proposal for data/:
README.md # manually generated, with links to the immediate subdirectories
frequencies/* # as is
morphology/* # as Reuben is currently developing
phones/README.md # autogenerated
phones/postprocess # new, but just calls whatever you do after you update phones files
phones/lib/*.py # everything else
phones/phones/*.phones
scrape/scrape.py
scrape/postprocess
scrape/README.md # autogenerated
scrape/lib/*.py # everything but scrape.py
scrape/lib/*.json
scrape/tsv/*.tsv
Sidebar: I've also asked the error analysis team to begin merging their code and data into data/
, so that will be a test case for whatever proposals...
The proposal makes sense to me, it’ll be a pain to move these files around and make sure everything works, but I think it’s worth it.
I suppose in phones/postprocess
we can call generate_phones_summary.py
and maybe normalize.py
? I think those are the only relevant steps.
Should we break this up into smaller tickets or just assign ourselves to sections here? I can tackle the scrape stuff tonight, although maybe it'd be better to hold off until after the covering grammar stuff gets merged.
normalize.py
I think goes in phones/lib
since it's not needed except to fix problems. I would just make sure there's a postprocess
next to each autogenerated README...
If you want to try to break it up into smaller tickets go ahead. I'm terrible at that. Feel free to assign some of it to me.
I think you can do this independently of the CG work, since I asked Arundhati to take her time.
Now that #364 is in this can go through, methinks.
Aidan and I are both going to move some stuff around. I won't have time to work on this until Thursday.
Should I close the bug @ajmalanoski @lfashby?
There are still a few things left to do. I've got to fix some paths in the big scrape README and we need to take care of the covering grammar/error analysis files. Ultimately I think we want to get rid of data/src
entirely.
Also we've got to write up data/README.md
.
SGTM.
On Fri, Mar 26, 2021 at 12:22 PM Lucas Ashby @.***> wrote:
There are still a few things left to do. I've got to fix some paths in the big scrape README and we need to take care of the covering grammar/error analysis files. Ultimately I think we want to get rid of data/src entirely.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/360#issuecomment-808346166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OKSEEJQOLHTOD7KBJTTFSYDLANCNFSM4XZR3Y4Q .
wikipron/data/src
is host to a lot of different files that do a lot of different things in different places. The unorganized state ofwikipron/data/src
may be off-putting to new/prospective contributors who don’t want to wade into this large collection of files to try and get their bearings.I have a few suggestions for things we can do to reorganize
wikipron/data/src
, though I’d also like suggestions for how (or even if)wikipron/data/src
should be reorganized from those who contributed some of these files.A few things come to mind immediately:
postprocess
but for whatever reason only callsplit.py
from within it. We should callgenerate_tsv_summary.py
frompostprocess
as well and maybe put bothsplit.py
andgenerate_tsv_summary.py
in ansrc/postprocess_helpers
subdirectory..json
files should go in a subdirectory (or at least all of them that are notlanguages.json
).covering_grammar.py
seems a bit homeless. I don’t think we’ve used it.I’m not sure if putting things in subdirectories of
src
is a great solution, so any alternatives would be welcome.