PathwayCommons / pathway-abstract-classifier

A tool to classify journal abstracts with pathway content.
MIT License
0 stars 1 forks source link

Update archive #17

Closed Steven-Palayew closed 2 years ago

Steven-Palayew commented 2 years ago

Updates the archive folder with most important code in model development and testing process, along with a description file. Some questions that should probably be addressed before we merge:

  1. I'm using the original code here I used to generate model etc, but would it be a good idea to add more comments to this code? Or just leave as is? If yes to comments, should I create two copies (ie. original, commented)?
  2. Am I including too much code/files? Too little?
Steven-Palayew commented 2 years ago

Note - I started updating this in main branch rather than this branch - I recommend you review the whole folder.

JohnGiorgi commented 2 years ago

Hey @Steven-Palayew what is the purpose of the archive folder? Also tagging @jvwong because I think he is more likely to have to use this than me.

I'm using the original code here I used to generate model etc, but would it be a good idea to add more comments to this code? Or just leave as is?

The rule of thumb is to comment anything that is non-obvious and not understood from the code itself. E.g. if you have a line of code that does model = load_model(...) don't comment this "Loads model" because that's easily understood. If there's something someone can't guess based on your code, you should likely comment it. See the advice here.

If yes to comments, should I create two copies (ie. original, commented)?

No, don't maintain two versions of the code ever if possible. This violates the DRY principle.

Am I including too much code/files? Too little?

You should include anything necessary to repeat what you did, including training and evaluating the model, tuning hyperparameters, etc. etc.


Small notes:

Steven-Palayew commented 2 years ago

@JohnGiorgi this is just meant as a reference, and contains the most important notebooks involved in the model training/development process. If I included literally everything involved in this process, you're talking about a number of notebooks that I don't think would be practical to go through ( ie >10).

You should include anything necessary to repeat what you did, including training and evaluating the model, tuning hyperparameters, etc. etc.

So basically while I think this seems overkill, I figure I need some sort of middleground between that and just not having an archive folder at all.

JohnGiorgi commented 2 years ago

@JohnGiorgi this is just meant as a reference, and contains the most important notebooks involved in the model training/development process. If I included literally everything involved in this process, you're talking about a number of notebooks that I don't think would be practical to go through ( ie >10).

You should include anything necessary to repeat what you did, including training and evaluating the model, tuning hyperparameters, etc. etc.

So basically while I think this seems overkill, I figure I need some sort of middleground between that and just not having an archive folder at all.

Just use your best judgment. As a very general rule of thumb anything needed to reproduce your results should be on GitHub, there's no argument here for omitting it. No problem if you don't want to include one-off notebooks that you used for less critical things.

Steven-Palayew commented 2 years ago

@JohnGiorgi a little bit confused as to what you mean by this.

demo_model_kaggle.ipynb should be enough in and of itself to just train a model given data that has been collected, and when necessary combined correctly, along with potentially mislabeled articles identified by cleanlab. So I suppose this is sort of enough to reproduce my results given I also included data in Data folder? From there I also included:

I omitted:

JohnGiorgi commented 2 years ago

@JohnGiorgi a little bit confused as to what you mean by this.

demo_model_kaggle.ipynb should be enough in and of itself to just train a model given data that has been collected, and when necessary combined correctly, along with potentially mislabeled articles identified by cleanlab. So I suppose this is sort of enough to reproduce my results given I also included data in Data folder? From there I also included:

* files I used to assess model performance

* file that uses cleanlab to identify potentially mislabeled articles, since this isn't routine or basic.

I omitted:

* Files that basically just did data wrangling on the somewhat messy data I was initially given

* Files that were used for experimentation

You originally asked: "Am I including too much code/files? Too little?" I was just trying to answer that question. If you already include everything needed to reproduce your trained model, then you are not including too much or too little.

Steven-Palayew commented 2 years ago

Right sorry if I wasn't super clear here. So just to confirm it's okay that I included enough on here just to reproduce the training and asssesment of the model featured on this repository, and I don't need to include enough code to replicate everything important I did during the term?

JohnGiorgi commented 2 years ago

Right sorry if I wasn't super clear here. So just to confirm it's okay that I included enough on here just to reproduce the training and asssesment of the model featured on this repository, and I don't need to include enough code to replicate everything important I did during the term?

I don't see the argument for excluding anything -- unless it was just some throwaway script you used once. You don't have to spend a bunch of time cleaning it up or anything, just upload it to archive and provide some minimum documentation or description.

Steven-Palayew commented 2 years ago

@JohnGiorgi lmk if we're good to go here

JohnGiorgi commented 2 years ago

@JohnGiorgi lmk if we're good to go here

Yup will you just rename the description file to README.md so GitHub renders it?

Steven-Palayew commented 2 years ago

Yep done

JohnGiorgi commented 2 years ago

Yep done

Awesome, LGTM!