Closed jvwong closed 2 years ago
Other notes:
predict
[SEP]
thing doing?My experience trying to follow the README:
What version of python do I need?
- I am using 3.8 but my conda environment defaulted to 2.7
What type of performance should I expect on some typical hardware?
- If I run the sample code, how long is this going to take? What should I expect?
Typo (extra closing bracket) in the README
predictions = model.predict(texts))
Once you have requirements (there are only 3 - see requirements.txt) installed, you can simply run:
- what does 'run' mean?
- how do I know if it worked?
Running
I followed the
Installation
and copied theQuickstart
code into amain.py
in the top level directory. Then I ranpython main.py
and got a bunch of warnings:site-packages/ktrain/text/preprocessor.py:216: UserWarning: List or array of two texts supplied, so task being treated as text classification. If this is a sentence pair classification task, please cast to tuple. warnings.warn('List or array of two texts supplied, so task being treated as text classification. ' +\
I think it worked but I'm not sure - What should I expect?
I'm using Python 3.9. When trying to make this a poetry package I believe that we were thinking anything over 3.7 should be fine. @JohnGiorgi thoughts here?
Good question. I will test this.
Nice catch, not sure how that happened and will change accordingly thanks!
Honestly that description was originally from when I had a three line quickstart, and feels a bit outdated and awkward now. I think I'm just going to delete it.
I get that warning too. Pretty sure it's fine and expected. Maybe @JohnGiorgi can confirm?
We discussed the [SEP] thing in the Slack a little while back. Basically it's just expected by BERT based models to separate the two parts of the input.
Output will be either 1 (article belongs in Biofactoid) or 0 (article does not belong in Biofactoid). In terms of input, I showed a variety of options in the tutorial that I am going to push later today, but for more detail I recommend you take a look at the Ktrain docs.
Hope this helps! 😄
Maybe think about updating the README for some of these . Also, here's an example of documenting important functions:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/join
I used the test_model.py code (which the Quickstart is based off of) and runtime on my laptop is just over 20 seconds. My laptop has an Intel i7-1165G7 and 16 GB of RAM.
RE Python version: @Steven-Palayew Can we add this line right under Installation: "This repository requires Python 3.7 or later"
RE What does "run" mean: @jvwong This just means run the code in any python interpreter.
RE How do you know if its works: @jvwong This is what the assert
statement is for. If it didn't work the assert statement will throw an error.
RE The warnings: These are unfortunately out of our control as they are being logged by our dependencies. We could add some boilerplate to remove them but IMO that would be even more confusing. Another option would be to add a note in the readme under the code snippet in quickstart, something like:
ktrain
may throw aUserWarning
which you can safely ignore.
RE The SEP token: @jvwong This is soft requirement of the pre-trained model we are using. It's used during training to denote the title from the abstract for the model, so we have to include it when making predictions. @Steven-Palayew I think this will be less confusing once #8 is addressed.
@jvwong Thanks for the feedback this will improve the readme. @Steven-Palayew I wonder if we should also provide a quickstart notebook and link it to colab? Then a user can actually see the whole process end-to-end.
@JohnGiorgi I made a PR addressing #12 and by extension, the first few suggestions you brought up here. In terms of the last point, I added a tutorial which I linked in the ReadME which I believe addresses this. I still want to modify it based on your suggestion about the [SEP] tokens, and hopefully if @jvwong can package the code to go from UIDs-> titles+abstracts by end of week I can also include an example of how that pipeline (UIDs-> Classifications) would work.
Cool, I would add a colab link to the readme: https://colab.research.google.com/github/PathwayCommons/pathway-abstract-classifier/blob/main/Tutorial.ipynb. That way someone can try it out quickly in the browser. This is also a good place to show the installation process.
My experience trying to follow the README:
Running
I followed the
Installation
and copied theQuickstart
code into amain.py
in the top level directory. Then I ranpython main.py
and got a bunch of warnings:I think it worked but I'm not sure - What should I expect?