Support for Sample Account

pmorrill commented 5 years ago

I would probably want the sample account to be restricted, so people don’t use it for accessing level 5 datasets.

Perhaps we can make it so unauthenticated calls only can see the sample collection? That way, we don’t need to set up a sample login and/or password? If we set the sample collection to level 5, then people with authenticated calls can also access that resource.

Paul would need to make an exception in the API to allow the sample account to by-pass authentication, and make it so the list of collections available when not sending a token is limited to the sample.

Does that make sense?

steffilazerte commented 5 years ago

My only request would be that the sample account has actually 2 collections, so I can use it in my automated tests to ensure we test more advanced functionality (like cycling over two collections).

denislepage commented 5 years ago

We can certainly have more sample codes, at least for the purpose of testing.

steffilazerte commented 5 years ago

What exactly was the final verdict here? Do all calls need to be authenticated? Do we have a sample account (or should I set that up?)

If all calls need to be authenticated, I will have to make some changes to my examples and unit testing. Not a problem, but something we should make sure happens sooner rather than later.

denislepage commented 5 years ago

Yes, I would like all data download calls to require a NatureCounts authentication, but I would be happy to find a simple solution for the sample datasets.

Authentication should not be required for the initial query that returns a list of collection. As mentioned before, I think that first call should return all possible matching datasets that are level 3 or higher, regardless of the user, as well as any datasets that the user has been granted access, including those below level 3.

For the initial call:

unauthenticated requests and authenticated requests for users without specific permissions return the same thing: a list of collections available, with the access level and the number of records matching the query.
authenticated calls for users with READ role or with collection-specific permissions should return the same thing, PLUS any additional collections matching his/her access permissions.

On the database side, I think the current proc can be used without modifications:

For unauthenticated calls: EXEC [prSelectCollectionsPermissions] @collection = null, @user_id = NULL, @min_akn_level = 3

For authenticated calls: EXEC [prSelectCollectionsPermissions] @collection = null, @user_id = ?, @min_akn_level = 3

In both cases, the full list of collections with akn_level >= 3 (currently 165), plus any matching the user permissions, is returned with the following fields:

akn_level : value between 0 and 5 matching the collection level.
coll_user_status: collections-specific value for the user (ranging between 0 and 9). For reading, anyone with a value >= 7 should be granted access. If a user has READ role, I am assigning the value 7 if the level of access associated with the role matches the collection.
nc_user_status: NatureCounts wide (single value per user, ranging between 0 and 3), indicating whether the user is a registered NC user, a coordinator or an administrator. Given the new role approach, we probably only care whether the value is 0 or greater than 0.

In the response, I would also include a field called has_access, which signals to the user whether their current permissions grant them access, and the number of records counted from bmde_data:

has_access = (nc_user_status > 0 AND (akn_level == 5 OR coll_user_status >= 7))

Steffi’s code would then only try to loop through collections with has_access = true, but the user would also see the list of possible matches available for which they can potentially request permission.

For requests with requestID’s, I have also experimented a different proc that allows returning the list of collections approved for a request, though I am not sure if that will be useful or not, given that you already have to load the request details from the database anyway. A possible benefit is that it combines permissions that the user may already have on the specific collections, even when the request status is not 9, and limits the list of collections to those included in the request on the database side.

EXEC [prSelectCollectionsPermissionsRequest] @user_id = 1081067, @request_id = 151443

Again, the same rule about access would apply:

has_access = (nc_user_status > 0 AND (akn_level == 5 OR coll_user_status >= 7))

An interesting option is that the user_id doesn’t have to match the one assigned to the request ID. As a coordinator, I could therefore use my own login to download data for a specific requestID made by someone else. I would not expose the data request details to the public (e.g. filters and narrative), but showing them a list of returned collections and number of records should be OK for all users, I would think.

EXEC [prSelectCollectionsPermissionsRequest] @user_id = 48001, @request_id = 151443

All that doesn’t yet address the sample datasets, which I realize was the original point being asked… :-/

denislepage commented 5 years ago

Thinking about this a bit more, based on my previous post, perhaps the preferred/simpler option for a sample account would be to have a user_id with a simple name and password, to which we can give explicit permissions on the sample collections, a bit like we do in Motus.

I think that if I set the AKN level of the sample collections to zero, then they would generally not be included in any of the queries unless someone used the sample login.

The only thing we will probably want to do on the API side is ensure that the sample account isn’t used to access level 5 collections, so that will require an exception.

We’d also have to decide whether a query using the sample login would initially also return all level 3 datasets, but with “has_access=false”, or remove those from the results. In my previous note, I already suggested that the list of collections and record count should be available for all calls, including the unauthenticated ones. I think I can set the user profile so the nc_user_status returned by the proc is set to 0 and the user cannot log in through the web interface.

I have already created the following collections:

SAMPLE1 SAMPLE2

I have added 1000 records each (taken from EBIRD Ontario). We can play with the content later if we want to highlight something specific with examples, etc.

I also have a user with login name “sample” and password “sample”, which should be restricted from doing anything on the web site (user_id = 1030008). We can give it another name or password if needed.

Paul, if you use this, like for any queries:

exec prSelectCollectionsPermissions @collection = null, @user_id = 1030008, @min_akn_level = 3

The list of collections should be the same as for an authenticated query (nc_user_status = 0), except for the 2 sample datasets (coll_user_status = 7).

The access condition (for all queries) would therefore needs to change as follow:

has_access = (nc_user_status > 0 AND akn_level == 5) OR coll_user_status >= 7

The only difference here is that we would allow users with nc_user_status = 0 to be explicitly listed on individual collections, but that would be rare (likely limited only to the sample account).

As long as you always validate the user credentials before calling this proc, that should work as intended.

steffilazerte commented 5 years ago

Okay, some of this looks relevant for me, but some not. Let me see if I understand my role in this!

In the R package naturecounts...

Users must supply a username Even if that's just the 'sample' username and 'sample' password (or what ever we end up with)
Currently, when users download data, only data they have access to is displayed. However, I should add an edit so that data they do not have access to should also be displayed so they know which collections are available. This might look something like this:

> d <- nc_data_dl(username = "steffilazerte", collections = c("BCATLAS1BE_DO", "BBS"))
Using filters: collections (BCATLAS1BE_DO); fields_set (BMDE2.00-min)
Collecting available records...
     collection   nrecords   has_access
1           BBS    5735895        FALSE
2 BCATLAS1BE_DO      41412         TRUE

Downloading records for each collection:
  BBS 
    Skipping BBS, you do not have access to this collection...
  BCATLAS1BE_DO
    Records 1 to 5000 / 41412
    Records 5001 to 10000 / 41412
    Records 10001 to 15000 / 41412
    Records 15001 to 20000 / 41412
    Records 20001 to 25000 / 41412
    Records 25001 to 30000 / 41412
    Records 30001 to 35000 / 41412
    Records 35001 to 40000 / 41412
    Records 40001 to 41412 / 41412
>

The only problem with this is that it will mean downloading the counts for all possible data which is much slower than downloading the counts for a single user (at least for the first time, after which I think it's cached)

nc_count(username = "steffilazerte", show = "all") Counts all data and takes ~30s
nc_count(username = "steffilazerte") Counts data accessible to 'steffilazerte' and takes ~5s

This might be frustrating for small downloads or for users with minimum access.

Have I understood?

denislepage commented 5 years ago

Sorry, I think some of my comments are borne of an incomplete understanding of what you have already done in R, and which function we are talking about.

I didn’t expect the count to be much slower for a full list vs. the user-list, but that may depend on which collections are being counted and their size.

This query returns in 0.3 seconds vs. 0.2 seconds when limited to BCATLAS. Did you try on multiple instances?

select collection, COUNT(*) from bmde_data where collection in ('BBS', 'BCATLAS1BE_DO') group by collection

The query that counts everything without any filters is indeed slower (anywhere between 4 and 6 s), but that is not surprising. I got 14 seconds once, or a busy server may occasional stall the response.

select collection, COUNT(*) from bmde_data group by collection

When I use the matching nc_count function, I get similar results (4-6 s):

nc_count(show=”all”)

A couple points:

I see no reason for the nc_data_dl function to use the show=all option.
All calls to the nc_data_dl should have a username
I would like the nc_count function to retain the show=”all” option. It won’t make a big difference whether this is implemented in R and/or the API, assuming the SQL count tallies a count based on provided filters, and then filter the response based on permissions.
Calls to nc_count can continue to be made without authentication, but this won’t provide information on user-specific permissions (has_access will be false for all collections when no user is identified, and the R users should probably be warned.
Show=all only makes sense for authenticated calls. I think we should set the show=all option for all calls without a username, otherwise they will always empty results.

pmorrill commented 5 years ago

Just so everyone is aware: I am not ignoring this issue, but trying to finalize other details in the api. However, using a 'sample' account like we do in Motus suites me.... it seems simplest. I need to wade through the content above before commenting further.

steffilazerte commented 5 years ago

The most recent version of naturecounts

nc_data_dl requires a user name for downloading data (and uses 'sample' in all examples)
for testing purposes, a password (sample) is supplied if the username is ever 'sample' (this happens internally and out of sight of the user). All other usernames will be prompted for a password
nc_count can take a user name for counting available records. With no username, it reverts to show = "all", shows all record counts (and tells the user).

So unless there are bugs, I think we're done with this issue?

pmorrill commented 5 years ago

I have adjusted the api so that it no longer registers data download by the 'sample' account. This should prevent DataRequests from being saved when using sample.

steffilazerte commented 5 years ago

Awesome, I think we're good.

steffilazerte commented 5 years ago

Hmm, I spoke too soon. When I use the sample account to download data, the final call to the "release_request_id" results in the API sending me an error message: An error has occurred; java.lang.NullPointerException

I can fix this on the R client side (skip the release_request_id if the user is "sample"), but it might be better to fix this on the API side, what do you think?

pmorrill commented 5 years ago

I know what this is, and will fix it in the code on Monday. thanks.

pmorrill commented 5 years ago

I put a fix in for this problem, but I am not sure if the code has been deployed yet, since Denis is in Japan.

You could try a test though! Any request, where no actual get_data call is made, used to generate this null pointer exception. So, if you create a requestId using list_collections, then immediately try to 'release' that id, this error results.

The 'sample' account show similar problem, even when data was downloaded. This is because I made the change so that 'sample' downloads are not recorded into the database: they look like no data was downloaded, in the server code.

denislepage commented 5 years ago

I have posted the latest JAR to the sandbox site. I can post to live whenever. I am assuming the R package still currently relies on the sandbox API?

denislepage commented 5 years ago

I can post the fix today.

-------- Original message -------- From: pmorrill notifications@github.com Date: 5/31/19 02:52 (GMT+09:00) To: BirdStudiesCanada/NatureCountsAPI NatureCountsAPI@noreply.github.com Cc: Denis Lepage dlepage@bsc-eoc.org, Comment comment@noreply.github.com Subject: Re: [BirdStudiesCanada/NatureCountsAPI] Support for Sample Account (#17)

I put a fix in for this problem, but I am not sure if the code has been deployed yet, since Denis is in Japan.

You could try a test though! Any request, where no actual get_data call is made, used to generate this null pointer exception. So, if you create a requestId using list_collections, then immediately try to 'release' that id, this error results.

The 'sample' account show similar problem, even when data was downloaded. This is because I made the change so that 'sample' downloads are not recorded into the database: they look like no data was downloaded, in the server code.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/BirdStudiesCanada/NatureCountsAPI/issues/17?email_source=notifications&email_token=AFBB6DU3GY3H5U4BEYX6ZI3PYAH6XA5CNFSM4HDTUBNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWTAZNQ#issuecomment-497421494, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFBB6DWS7URX3L64C6PLLYLPYAH6XANCNFSM4HDTUBNA.

steffilazerte commented 5 years ago

@denislepage yup, still using sandbox in the R client @pmorrill I think that's done it. At least I dont' get the error when clearing the request using the sample account. Thanks!

BirdsCanada / NatureCountsAPI

Support for Sample Account #17