allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.75k stars 2.25k forks source link

Module organization is a bit confusing #4051

Closed nelson-liu closed 4 years ago

nelson-liu commented 4 years ago

Thanks for working on this, it seems like a nice step in the right direction.

When poking around, I found the structure of the module to be quite confusing, up to the point where I even thought that things were removed.

Let's take allennlp_models.rc as an example:

so all the bidaf stuff is in allennlp_models.rc.bidaf, this makes sense

allennlp_models.rc.common has some utils for readers and models...and also eval scripts (but not all of them)? and the squad dataset reader?

allennlp_models.rc.dialog_qa has the dialogQA model, and also the QuAC dataset reader? It's weird to me that sometimes, dataset readers are in specific model directories, and other times they aren't.

not sure what makes allennlp_models.rc.other "other'...

allennlp_models.rc.qanet has the DROP reader? and the NaQANet model as well? I'd have expected these two to be in different places...also quoref eval?

My point in making the issue is that, for a given component that I'd like to use, it's hard for me to predict where it'd be. Why not just mirror the structure of the allennlp repo, and have:

allennlp_models.rc.data
allennlp_models.rc.models
allennlp_models.rc.common / util
allennlp_models.rc.metrics

etc..

instead of trying to group by "task" / "dataset", which can be a blurry and arbitrary line to draw?

nelson-liu commented 4 years ago

(tagging @dirkgr , noticed that you aren't watching the repo, sorry for the noise)

nelson-liu commented 4 years ago

Beyond this: not sure that grouping things up by rc, ner, etc... is the right grouping as well. Those things are fairly well-defined, but for a module like syntax, it's unclear what would be in that module just given that name. I certainly wouldn't have expected SRL / OpenIE to be in there...I'd split that up into constituency_parsing, dependency_parsing, srl, openie, or just get rid of the whole thing and have one allennlp_models.models

dirkgr commented 4 years ago

@matt-gardner, what do you think?

I did it this way partially because the original plan was to have separate repositories for each of the categories, but also partially because when you work on a model, you usually work on the reader and the model at the same time, and it's confusing to have them far away from each other in the file hierarchy. It also makes it harder to copy and paste out exactly what you need.

We could bring back data / models / predictors / metrics for each subdirectory, but we would just create a bunch of directories, each of which contains only one file. I don't see how that helps with discoverability.

qanet has the DROP reader because naqanet is there, and naqanet is the only model using that reader. If you make your own DROP model, chances are you can't re-use the same reader anyways.

I added specific model directories only in the cases where there was too much unrelated stuff and I needed to wanted to keep things a bit more organized. This is true in the case of rc. In another case, for example coref, there is only one model. What's the use in sticking that into its own directory?

There are some components that are re-used across models, and I don't have a solid plan for those. That's where the "other" or "common" directories come in. There is probably a better solution.

MaksymDel commented 4 years ago

Alternatively, we could name models, readers, and predictors consistently and put them into the root folder. Huggingface, for example, uses this project structure and I would say I personally like it.

It would not work for the core allennlp repo, but for allennlp-models it makes a bit of sense to me:

reader_squad_transformer.py
reader_lm.py
...
model_bidaf.py
model_transformer_qa.py
model_transformer_ner.py
model_lm_calipso.py
model_lm_masked.py
model_lm_next_token.py
...
predictor_nli.py
predictor_transformer_qa.py
predictor_transformer_ner.py
predictor_lm.py
predictor_lm_masked.py
predictor_lm_next_token.py
...
common / util folder

I am not sure what is the best though. The root folder might become too long in this case and grouping by tasks totally makes sense since readers, models, and predictors are coupled and having separate folders highlights it...

So what is needed is the ease of discovery.

Maybe some kind of mix of both? E.g. general folders roughly representing task types: squence_labeling / sentence_classification / sequence_to_sequence / question_answering / language_modeling / etc. Inside them consistently named files like model_* + predictor_* + reader_*.

These folder names are of different nature (e.g. masked_lm is an example of sequence_labeling), but from the user perspective, I would immediately discover where BERT goes and where goes NER or POS tagging. And inside the concrete subfolder, I can easily navigate between mdoel* / reader* files.

matt-gardner commented 4 years ago

My vote would be for a top-level directory structure based on formats:

Then, within that top-level directory, if there are just a handful of files, it can be flat (like for coref). For something like reading comprehension that has a ton of datasets and models, and auxiliary code, I'd probably vote for something like the allennlp module organization (readers / models / eval / util).

nelson-liu commented 4 years ago

Delayed on this, but isn't this an issue for allenai/allennlp-models? I think the current structure of allennlp is fine, it's the models repo that is confusing.

nelson-liu commented 4 years ago

just wanted to ping on this again, since it looks like there are preparations for a 1.0 . Is this something that is planned to be folded into 1.0 of allennlp-models?

There's a lot more inertia wrt moving stuff around after an initial release, for backward-compatibility reasons...so I'd really like to see it in 1.0

I'd even be willing to volunteer to make this change, it's really that important to me from a usabability standpoint.

epwalsh commented 4 years ago

I agree we should address this before the big one point o and I like the idea of organizing by task format

matt-gardner commented 4 years ago

Seems like we've got enough votes to move forward? @nelson-liu, I'd say please go for it, and @dirkgr, let us know if you have any objections.

nelson-liu commented 4 years ago

Just so I know, for prioritization purposes: do you all have a particular timeline for 1.0?

epwalsh commented 4 years ago

@nelson-liu I don't think we have much of a timeline, it's just going to be whenever we get through the remaining issues on this milestone: https://github.com/allenai/allennlp/milestone/10

dirkgr commented 4 years ago

Alright, I'll change this according to @matt-gardner's suggestion above.

dirkgr commented 4 years ago

The PR is up at https://github.com/allenai/allennlp-models/pull/58. I put @matt-gardner and @nelson-liu on as reviewers.