dockstore / dockstore

Our VM/Docker sharing infrastructure and management component
https://dockstore.org/
Apache License 2.0
116 stars 27 forks source link

SEAB-6423: Consistently impose SourceFile limits #5893

Closed svonworl closed 2 weeks ago

svonworl commented 1 month ago

Update

Alrighteee... the original PR has been overhauled. Important changes and observations:

Description Whilst diagnosing https://ucsc-cgl.atlassian.net/browse/SEAB-6423, we discovered that when registering a 2.5MB notebook file, the corresponding SourceFile's content is null. This causes the UI "Preview" notebook tab to report that the notebook does not have any content.

Investigation revealed that the webservice only enforces the maximum size and non-binary SourceFile requirements on the old entry registration path. On the .dockstore.yml registration path, SourceFiles are still effectively limited, but only because:

To address the above, this PR:

If we consistently use these SourceFile helper functions, we make it easy to tweak the SourceFile creation process (to add a new requirement, for example) and ensure that, should we modify the SourceFile object itself, all fields get copied correctly.

As before, when a file doesn't meet the size or non-binary requirements, we set the corresponding SourceFile's content to an error message. For example, "Dockstore does not store binary files". However, code that subsequently uses the SourceFile has no way to differentiate the error message from valid file content. So, after this PR, when a "too large" notebook is registered, the notebook's "Preview" tab will try to parse the error message as a Jupyter notebook, and fail in a different way.

Thus, in a followup PR, I propose to modify SourceFile so that we can differentiate a normal SourceFile from one that didn't meet our requirements. Perhaps we store the error message in a different field, or add a flag or enum that differentiates "COMPLETE" and "ERROR" content. An enum would allow us to support other states, such as a "SUMMARY" state, which would describe an abbreviated form of the full content (perhaps useful if we diversify into indexing AI models, which we might not want [or be able] to store in complete form).

It can be argued that [some of?] the helpers should be methods of SourceFile, etc. If anyone feels super-strongly, we can tweak. There's a rationale behind the current design, but I've already typed too much in this description, so I won't dump it here.

This PR is a little bit scary from a reliability/correctness point of view. The new code passes our existing tests, and appears to be covered. However, given the variety of states and contexts involved, it's very difficult to confirm that we haven't introduced a bug, without spending tons of time auditing the existing tests and filling any holes. Given the way surrounding code is currently architected, it's hard to test, which makes verifying correctness even more challenging. It'll help to visually inspect very carefully and be on the lookout for sourcefile-related problems after we deploy to qa, but it's still a little risky.

Review Instructions Register the notebook which is linked in the ticket, and confirm that Dockstore successfully read it, by visiting the notebook page and confirming that the Preview is correct.

Issue https://ucsc-cgl.atlassian.net/browse/SEAB-6423

Security and Privacy

No concerns.

e.g. Does this change...

Please make sure that you've checked the following before submitting your pull request. Thanks!

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 66.97248% with 36 lines in your changes missing coverage. Please review.

Project coverage is 74.36%. Comparing base (74674f8) to head (ebb3e5a).

Files Patch % Lines
...tore/webservice/core/LimitedSourceFileBuilder.java 10.52% 34 Missing :warning:
...re/webservice/languages/LanguagePluginHandler.java 81.81% 1 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #5893 +/- ## ============================================= - Coverage 74.54% 74.36% -0.19% - Complexity 5364 5368 +4 ============================================= Files 374 375 +1 Lines 19430 19418 -12 Branches 2028 2030 +2 ============================================= - Hits 14485 14441 -44 - Misses 3973 4007 +34 + Partials 972 970 -2 ``` | [Flag](https://app.codecov.io/gh/dockstore/dockstore/pull/5893/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | Coverage Δ | | |---|---|---| | [bitbuckettests](https://app.codecov.io/gh/dockstore/dockstore/pull/5893/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `26.93% <28.44%> (-0.09%)` | :arrow_down: | | [hoverflytests](https://app.codecov.io/gh/dockstore/dockstore/pull/5893/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `27.39% <26.60%> (-0.05%)` | :arrow_down: | | [integrationtests](https://app.codecov.io/gh/dockstore/dockstore/pull/5893/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `56.89% <55.96%> (-0.14%)` | :arrow_down: | | [languageparsingtests](https://app.codecov.io/gh/dockstore/dockstore/pull/5893/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `11.09% <0.00%> (+<0.01%)` | :arrow_up: | | [localstacktests](https://app.codecov.io/gh/dockstore/dockstore/pull/5893/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `21.59% <24.77%> (-0.03%)` | :arrow_down: | | [toolintegrationtests](https://app.codecov.io/gh/dockstore/dockstore/pull/5893/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `30.33% <32.11%> (-0.03%)` | :arrow_down: | | [unit-tests_and_non-confidential-tests](https://app.codecov.io/gh/dockstore/dockstore/pull/5893/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `25.99% <20.18%> (-0.01%)` | :arrow_down: | | [workflowintegrationtests](https://app.codecov.io/gh/dockstore/dockstore/pull/5893/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `38.20% <44.03%> (-0.18%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

denis-yuen commented 1 month ago

increases the maximum size limit of a Jupyter notebook file to 3MB.

From a different conversation, the limit per file is possibly less important than the limit per version (or even per workflow?). Link follow-up ticket here if spun off

Thus, in a followup PR, I propose to modify SourceFile so that we can differentiate a normal SourceFile from one that didn't meet our requirements.

:+1:

svonworl commented 3 weeks ago

Ok, I overhauled the PR to address most of the feedback. Please refer to the "Update" section at the top of the PR description for more information.

sonarcloud[bot] commented 2 weeks ago

Quality Gate Failed Quality Gate failed

Failed conditions
63.2% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud