cran-task-views / ctv

CRAN Task View Initiative
82 stars 13 forks source link

CTVsuggest_ignore line in yaml of source .md file #51

Closed DylanDijk closed 1 year ago

DylanDijk commented 1 year ago

In https://github.com/DylanDijk/CTVsuggest/issues/2, @billdenney highlighted that it would be beneficial to take note of packages that have been suggested by CTVsuggest but are found to not to be of interest once evaluated by a Task View maintainer.

Bill proposed to add the packages in the yaml header of the CTV source .md file like:

CTVsuggest_ignore: ['package1', 'package2', 'package3']

This would work for me, but wanted to open this issue here as it is regarding the change of the Task View source file.

wibeasley commented 1 year ago

I like how that would reduce the number of internal documents/notes we need to maintain a view. Especially if we could list one package per line, with a reminder/explanation why

CTVsuggest_ignore: 
- 'package1' # not of interest because it's the wrong kind of cloud (2021-07-03 -Pachá)
- 'package2' # removed because package4 replaced it (2021-07-03 -Pachá)
- 'package3' # not recommended because it uses old security protocols. Reevaluate package if TLS 1.3+ is used. (2022-05-28 -Will)
zeileis commented 1 year ago

I like the idea of an ignore list for CTVsuggest() but I wouldn't put it into the .md. First, the list could become rather long, especially when comments are included as well. Second, the format of the package ignore listing is then different from the include listing. Third, it might be perceived as a kind of "blacklist", especially by the authors of the packages themselves.

Instead, I would recommend that there is a separate .R script, say CTVsuggest.R with content like:

remotes::install_github("DylanDijk/CTVsuggest")
CTVsuggest::CTVsuggest(taskview = ..., n = 10, ignore = c(
  "package1", # not of interest because it's the wrong kind of cloud (2021-07-03 -Pachá)
  "package2", # removed because package4 replaced it (2021-07-03 -Pachá)
  "package3"  # not recommended because it uses old security protocols. Reevaluate package if TLS 1.3+ is used. (2022-05-28 -Will)
))

This can then simply be sourced to obtain the suggestions including the ignore list.

billdenney commented 1 year ago

@zeileis, if the list gets long, then that suggests that the CTVsuggest tool is making too many irrelevant suggestions-- or that the CTV itself is very broad. Regardless, if the list is long, without a filter, I would stop using the CTVsuggest function because then I would have to wade through the irrelevant list too often. I'd rather the function do that for me.

For the concern about blacklisting, I would hate to give someone that feeling. So, perhaps we could work on that through naming it better like, "CTV_suggest_not_relevant". Then, combined with a clear rationale as @wibeasley suggested, it could at least feel like the author had the reason they were viewed as not relevant.

Has there been an issue in the past where CTV maintainers abused their position suggesting that the blacklist concern is precedented?

zeileis commented 1 year ago

If the list of packages to ignore gets long, it reflects that the inclusion/exclusion criteria are sharper than what is captured by the multinomial model in CTVsuggest. I don't think there is any problem with that.

So I agree that storing a filter somewhere is useful, I just wouldn't clutter up the .md with it but put it in a dedicated separate file.

Regarding potential abuse by task view maintainers: I wouldn't be aware of such cases. But we certainly had cases where the package authors had preferred if their package were included rather excluded. The cases I can think of, off the top of my head, were typically cases where the decision was made to only list a package in one task view rather than list it in multiple ones.

billdenney commented 1 year ago

I think that having all the information in one place is easier than having it in two. That would simplify maintenance and sharing the load of maintaining the view. And, having it in the task view allows potential improvement of the model.

But, I maintain neither the package nor the CTV infrastructure. So, I won't push back any more.

DylanDijk commented 1 year ago

Having the separate R script would work, as long as the R script is in the corresponding Task Views repository, then as Achim mentioned this can then be sourced.

However, I have been thinking that an alternative to filtering out the "ignored" packages from suggestions, would be to use the packages that are listed as ignored and add these to the "None" category in the model training. These packages will then no longer be suggested, and additionally this could improve the model performance.

If this alternative method is used, then an ignore argument for CTVsuggest() would not be needed. The ignored packages just need to be listed in some format in the Task View repository, in a separate file or in the source file.

For the other feature idea https://github.com/DylanDijk/CTVsuggest/issues/3, would it be OK to do as proposed and add the threshold to the source file? Or would we want to separate this as well. How about having a CTVsuggest .md file where different arguments can be listed in a standardised format?

zeileis commented 1 year ago

Hmm, the fact that one of the task views chooses not to include a package that is similar to packages it has already included does not necessarily imply that this should not go into any task view. For example, CTVsuggest currently proposes that GVARX should be included in the Econometrics task view. After looking at it, I think it would be better to put it into TimeSeries (in "Multivariate Time Series Models > Vector autoregressive (VAR) models"). So moving this to the None category wouldn't be a good solution.

Overall, I think it is important that CTVsuggest is a tool that can help discover potentially relevant packages but the assessments and decisions always have to come from the maintainers. So I would rather show too many proposals rather than excluding too much.

And this is also part of the reason why I think it's better to have the CTVsuggest settings in a separate place and not wired into the main .md file.

wibeasley commented 1 year ago

@DylanDijk, I agree with @zeileis's statement, "the fact that one of the task views...". The decision to ignore it from the final task view would probably have to be a different boolean flag than to keep/retain it for training. I'm guessing that's more trouble than it's worth for a performance improvement (and it's error prone for task view maintainers).

But you have a better feel for than than I do. I'm really impressed by your function's foundational idea of suggesting package candidates.

DylanDijk commented 1 year ago

@zeileis Ok, that makes sense. In addition, I would just like to highlight that a downfall of the model used is that it is not a multi-label classifier. A probability vector is outputted for each package, hence it is not possible for a package to have a large predicted probability for multiple Task Views. So, if GVARX is suggested for Econometrics it will not appear as a top suggestion for TimeSeries.

@wibeasley I agree, having two flags, one to list packages to ignore from suggestions and one to list packages to use in training, would be too much. For the package to then use these flags, however, would be quite straightforward. Thanks.

So, can we conclude that we will use https://github.com/cran-task-views/ctv/issues/51#issuecomment-1481857100

Then, for https://github.com/DylanDijk/CTVsuggest/issues/3, this would be a single line and will not affect suggestions or model training, so I think having this in the source .yaml file would work?

DylanDijk commented 1 year ago

Ok, the package is now updated, so with a new install you can write:

CTVsuggest::CTVsuggest("Econometrics", ignore = c(
  "GVARX" # better suited for TimeSeries TV
 ))

You should get the following output:

            Econometrics  Packages
JFE          0.9982844       JFE
iClick       0.9945763    iClick
lboxcox      0.9939188   lboxcox
wildrwolf    0.9934578 wildrwolf
gbmt         0.9932194      gbmt
zeileis commented 1 year ago

Thanks, Dylan @DylanDijk, very much appreciated! I think this should be good enough for now and we can close the issue. Maybe we can return to it in a few months when the different task view maintainers have more experience working with it.