dgkf / rvalhub-repo-filters-mvp

A quick demo of the R Validation Hub regulatory repositories filterable package index concept
4 stars 0 forks source link

Comments on prototype #1

Open Xyarz opened 1 year ago

Xyarz commented 1 year ago

@dgkf thanks for the initial prototype! Looks really nice I have to say. Especially the setup with the RProfile I do like :) I personally think this should already be enough for the presentation tomorrow and gathering initial feedback and getting your idea along. I think the idea is really feasible and easy to adjust, which is always handy with such approaches. Thoughts I am having right away, would be as you mentioned on the README already, to include dependencies and relations across certain packages and individualize the risk levels accordingly. For the initial setup I don't feel this would be necessary though. I could also see the option to provide a function / template which then populates the respective risk code needed based on specified parameters. In addition, a further option could be to automatically create and send out an email, containing the vulnerabilities found by oysteR to a specified person / email. Furthermore, I think timestamping, user & some system specs on who made those changes could be also very interesting to include in such a message, as it would increase reproducability and should be tracked I think.

A potential option to include at a later stage, could be also to add compatibility checks with renv snapshots as many already use those for dependency setting and then abort installation if this would lead to a change of version compared to the used renv snapshot. In regard to riskassessment, I could see the option to assign the accept_vulnerabilities = TRUE only to certain roles as well as work on compatibility with that working group as well on certain features.

Xyarz commented 1 year ago

Initial feedback from the group:

dgkf commented 1 year ago

Thanks for capturing notes @Xyarz - definitely a better recap than I would have come up with trying to remember all the comments afterwards.

I think this one is quite a necessary one to discuss in next steps:

Where do those numbers/scores do come from (unknown origin, missing specificity control)

I think it would be amazing if we could steer some consensus on a "pharma-ready base image" that could be used for this type of package inspection, which might improve reproducibility, but I think that is quite out-of-scope for this little POC. Perhaps a follow-up or collaboration with R-Hub?

Xyarz commented 1 year ago

Sure! I fully agree. Definitely a huge point to tackle, but also a very vital one. My thoughts were going maybe in the direction of the riskscore pkg as we would have a CRAN snapshots with an renv snapshot as well as timestamp to base calculation on that. Other than that, I could see this being based of renv snapshots in general. then all the system specs are well defined, transparent and reproducible. A pharma ready base image would be very nice indeed! Could definitely see a collaboration with the R-Hub there.

Stefan-Doering-BI commented 1 year ago

Hey @dgkf, I must admit I still have a too vague idea yet about what the MVP would be. Mostly based on the question when this really becomes viable to the user.

Is this MVP/filter ability meant to be used by the company/analyst or the "Regulatory Repo" itself, or maybe both? If the vision is "pharma-ready base image" then I would imagine the "Regulatory Repo" using the filter to have a very crude decision gate on whether a pkg is even eligible to be included into the image.

If the customer is (also) the company/analyst then I`m not sure whether the current prototype is already viable. Because as a company I would mostly like to filter maybe based on the intended-use, whether a package is a dependency or directly used and based on the validation documentation available.

I think some of these aspects are mentioned in "Future work", is this planned for this MVPs increment or future in the sense of the next product iteration ?

dgkf commented 1 year ago

@Stefan-Doering-BI All really great questions. This is definitely the most abstract of the pilots, and it covers two rather distinct features - client-side filtering repositories and pre-install vulnerability scanning.

Is this MVP/filter ability meant to be used by the company/analyst or the "Regulatory Repo" itself

In my mind, having "client-side" filtering of the repository allows us to be less opinionated in the repo, and allow those opinions on risk tolerance to still be expressed by the company/analyst.

For example, the repo can still contain a newly released package with almost no testing - likely considered a "high risk" package. Maybe for exploratory purposes, that's fine, so the company/analyst can choose to set a filter that permits even these high-risk packages.

Because as a company I would mostly like to filter maybe based on the intended-use

Definitely agree! There are plenty of other criteria we might consider. As a POC, I just chose a few that are unambiguous. We could imagine using CRAN task views (and dependencies) to filter by intended use, or introduce more pharma-specific use categories.

whether a package is a dependency or directly used and based on the validation documentation available.

The relationship among dependencies is definitely something I want to explore. The way that risk-tolerance cascades through a dependency tree varies across companies (from our case studies discussions). Some companies hold all packages (including dependencies) to a risk assessment, some only care about the surface-level user-facing packages. Personally, I'm interested to mock up a criteria that applies filters across the dependency tree as well.

I think some of these aspects are mentioned in "Future work", is this planned for this MVPs increment or future in the sense of the next product iteration ?

I think it would be worthwhile to show that dependency-based filters can be implemented (I'm very confident they can be), but I think the key next steps will be taking what we've learned and starting to think about what our end product looks like. That is to say, with these features in mind, do we want to commit to building and hosting a package repo? Do we want to just build a database that can be merged with existing repos? How do the other POC's trust data and issue reporting factor in?