Kaggle / kaggle-api

Official Kaggle API
Apache License 2.0
6.01k stars 1.06k forks source link

Kaggle API does not accept arrays as a value for "source" in .ipynb files #574

Open mgallifrey opened 2 months ago

mgallifrey commented 2 months ago

When attempting to use the Kaggle client to pull, edit and push notebooks associated with the Introduction To Machine Learning course, kaggle kernels push results in an internal server error. Steps to reproduce:

  1. Follow the installation & authentication steps from the documentation
  2. Pull down a notebook associated with an exercise in the tutorial:
    kaggle kernels pull michaelgallifrey/exercise-model-validation -p exercise-model-validation -m
  3. Edit the notebook
  4. Attempt to push the notebook, get error:
    kaggle kernels push -p exercise-model-validation
    500 - An internal server error occurred. Please ensure that your API client is up to date. If it is, please report a bug at github.com/Kaggle/kaggle-api - InternalServerError
  5. Verify your client is up to date before reporting bug as instructed:
    kaggle -v
    Kaggle API 1.6.12

In case it helps, here is the kernel-metadata.json (auto-created by the -m flag on the pull):

{
  "id": "michaelgallifrey/exercise-model-validation",
  "id_no": 55316402,
  "title": "Exercise: Model Validation",
  "code_file": "exercise-model-validation.ipynb",
  "language": "python",
  "kernel_type": "notebook",
  "is_private": true,
  "enable_gpu": false,
  "enable_tpu": false,
  "enable_internet": false,
  "keywords": [],
  "dataset_sources": [
    "dansbecker/melbourne-housing-snapshot",
    "iabhishekofficial/mobile-price-classification"
  ],
  "kernel_sources": [],
  "competition_sources": [
    "home-data-for-ml-course"
  ],
  "model_sources": []
}
stevemessick commented 1 month ago

Could you update to the latest version of the Kaggle API and let us know if this is still a problem? If it is, could you provide more detail on the edits you made before pushing? I'm unable to repro the problem.

mgallifrey commented 4 weeks ago

Thanks for looking into it and attempting a repro.

Unfortunately, I am still experiencing the issue:

abc@2a12a90d60dd:/mnt/hdd/src/kaggle$ kaggle -v
Kaggle API 1.6.14
abc@2a12a90d60dd:/mnt/hdd/src/kaggle$ kaggle kernels pull michaelgallifrey/exercise-model-validation -p try-again -m
Source code and metadata downloaded to try-again
abc@2a12a90d60dd:/mnt/hdd/src/kaggle$ kaggle kernels push -p try-again
500 - An internal server error occurred. Please ensure that your API client is up to date. If it is, please report a bug at github.com/Kaggle/kaggle-api - InternalServerError
abc@2a12a90d60dd:/mnt/hdd/src/kaggle$ 

As for the edit, I simply changed "You've built a model. In this exercise you will test how good your model is." to "You've built a model. In this exercise you will test how good your model is. Will it upload?" in the "Recap" section of https://www.kaggle.com/kernels/fork/1259097. It appears to happen with any edit I make though.

stevemessick commented 2 weeks ago

I wasn't able to reproduce your precise problem, but I think you were hitting a bug in the server that has been fixed.

I was able to get an error, which may be what you should have gotten:

Kernel push error: Notebook not found

The problem is I had not versioned the notebook. After I created a version (and ran it, not quick version), then I could push it with no problem.

Let me know if that helps.

mgallifrey commented 2 weeks ago

Thanks for all your hard work on this. Still a no go :(

I saved a version (using "Save & Run All (Commit)"), then did the following:

abc@2a12a90d60dd:/mnt/hdd/src/kaggle$ kaggle kernels pull michaelgallifrey/notebook86b2ec8431 -p test3 -m
Source code and metadata downloaded to test3

Changed "In this exercise you will test how good your model is." to "In this exercise you will test how good your model is and if you can push" and then:

abc@2a12a90d60dd:/mnt/hdd/src/kaggle$ kaggle kernels push -p test3
500 - An internal server error occurred. Please ensure that your API client is up to date. If it is, please report a bug at github.com/Kaggle/kaggle-api - InternalServerError
abc@2a12a90d60dd:/mnt/hdd/src/kaggle$ kaggle -v
Kaggle API 1.6.14

Anything else you want me to try?

stevemessick commented 2 weeks ago

Thanks for the quick response. I'll have to dig into this further.

On Tue, Jun 11, 2024 at 7:57 PM mgallifrey @.***> wrote:

Thanks for all your hard work on this. Still a no go :(

I saved a version (using "Save & Run All (Commit)"), then did the following:

@.***:/mnt/hdd/src/kaggle$ kaggle kernels pull michaelgallifrey/notebook86b2ec8431 -p test3 -m Source code and metadata downloaded to test3

Changed "In this exercise you will test how good your model is." to "In this exercise you will test how good your model is and if you can push" and then:

@.:/mnt/hdd/src/kaggle$ kaggle kernels push -p test3 500 - An internal server error occurred. Please ensure that your API client is up to date. If it is, please report a bug at github.com/Kaggle/kaggle-api - InternalServerError @.:/mnt/hdd/src/kaggle$ kaggle -v Kaggle API 1.6.14

Anything else you want me to try?

— Reply to this email directly, view it on GitHub https://github.com/Kaggle/kaggle-api/issues/574#issuecomment-2162009278, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA7VDNW32WKYVSZ5XOX4M3ZG62KJAVCNFSM6AAAAABHA6TTY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRSGAYDSMRXHA . You are receiving this because you commented.Message ID: @.***>

stevemessick commented 3 days ago

@mgallifrey Sorry for the delay. It occurs to me that there could be an issue with line endings. I wasn't able to reproduce the problem, but I'm pretty sure my editor didn't change any line endings, either. Can you check the before/after content of your *.ipynb file to see if the line endings changed? If that isn't the problem, could you make your notebook public so I can test using it?

mgallifrey commented 3 days ago

Good idea! No difference in line endings (that I can tell), but it looks like at minimum (diffing one line files is hard) the order of some of the JSON got switched around by the editor. Is there any reason to believe there's something private in the ipynb file or would it be safe for me to attach the before and after here for you to have a look?

For what it's worth, I don't think my editor is particularly strange: I'm editing via the VSCode Jupyter plugin (albeit via an open source VSCode fork)

stevemessick commented 3 days ago

I can't rule out the possibility that the reordering is causing the problem. If you pull/push without editing, does it work?

If not, you can attach the before and after versions here.

mgallifrey commented 3 days ago

Ok, I can confirm that I can push and pull when no changes are made.

Your hunch about line endings was a good one: I went ahead and put the before and after files into https://www.jsondiff.com/, and it looks like VSCode is changing the value of "source" in each cell from a string to an array of strings (with each line as an element, still each terminated with '\n'). Ordering aside, that appears to be the only semantic difference

I'm assuming the VSCode output is still valid ipynb (although I haven't checked the spec); if so, can it be supported by the API?

stevemessick commented 3 days ago

Thanks for checking that.

I think our version of JL is kinda old. I suspect VSCode is targeting a newer JL spec than what we're using. I don't know if there are any plans to update JL (not saying no, just I don't know). VSCode has a huge number of settings. Is there a way to tell it you're working with v2, not v4, JL files? If not, your best bet is to use a dumber editor.

mgallifrey commented 3 days ago

Pardon my ignorance: what does JL stand for?

In any case, no setting that I could find, unfortunately :(

FWIW, the ipynb file that gets pulled down by kernels pull says it's nbformat version 4.4:

    "nbformat": 4,
    "nbformat_minor": 4

and the reference schema for 4.4 says source is supposed to be "represented as an array of lines". The docs say either an array or a string is fine. Totally get that this likely isn't a priority; figured I'd share either way though!

(edited to fix a typo)

stevemessick commented 2 days ago

Sorry, JL is JupyterLab, the notebook editor. It's an open source project. The latest version is 4.2.3.

Thanks for the links. Drilling down to the schema definition for source, I see that either a string or array of strings is accepted, so I wonder if there is some other problem that is preventing push from working.

Can you attach your edited notebook? I'll try to repro the problem, then look at the server logs to see what's breaking (if I'm lucky enough to find anything :)

mgallifrey commented 2 days ago

Nice catch! That's what I get for just reading the description and not diving into the definition.

And sure! I've attached both the original (or at least something pulled down via kernels pull; it has some non-VSCode edits that were successfully pushed) and the same file after being edited in VSCode. They're bundled as a zip file because GitHub won't let me upload IPYNB files.

If I get a chance, I might write a little script that turns the source arrays into string and try pushing the resulting notebook to confirm that's the issue.

stevemessick commented 2 days ago

Thanks. I haven't had a chance to look at your files yet, but I did read some source code and think the source being an array of strings is the problem. We need to use the Google Cloud protobuffer file format to upload everything. Our protobuf definition for push only allows strings as the source code. In theory, we should be able to make the Python client detect an array of strings and convert it to a single string (essentially embedding the script you described into the kaggle client), but I have to admit this isn't a very high priority right now.

If you had some free time and wanted to make a contribution, the code to modify is (I think) at this point in the source. But I understand that setting up a dev environment for kaggle-api is a bit time-consuming (having done it a couple times).

mgallifrey commented 2 days ago

Sounds good! I'll take a stab at it when I get a chance.

mgallifrey commented 10 hours ago

You're not wrong about the dev environment being somewhat time-intensive :)

I came across some dependency issues; I filed a bug and submitted a PR for that too.