OpenMined / opus

Apache License 2.0
22 stars 9 forks source link

Determine architecture for a web scraping service for identity verification #23

Open carrollgt91 opened 4 years ago

carrollgt91 commented 4 years ago

If we only rely on SSO-based API integrations, we won't be able to provide very strong guarantees of identity. Unfortunately, most of these APIs do not provide information that they verify in a meaningful way.

However, thanks to GDPR and other legislation, almost all user data has to be accessible via at least a web UI in order for applications to comply with the restrictions placed on them.

In the long run, it would be great to have a more direct way to access this information, but in the short run, we can build tools for web-scraping that will help us get much more meaningful identity verification.

There are a number of architectural decisions to make here. Let's discuss potential ways of accomplishing this here. I've reached out to a friend who has recently built a fairly sophisticated client-side web scraping system that will hopefully be able to provide further insight.

Architecture proposal

For web-based users: We could build a Chrome/Firefox extension that hooks into our backend servers via websocket. This would allow us to "drive" the user through the necessary steps to grab information from third-party services. We'd definitely want to think through fraud prevention there, as blindly trusting client-side data is rarely a good idea. Our best bet would be to actually use the extension to effectively "sniff" on the API requests providing the information to the page (not sure how this is possible, but I have it on good authority that it is, at least with Google Chrome extensions.)

For mobile users: This will be a poorer UX for sure, but for accessibility purposes, it's very needed. A react-native application that effectively renders a webview and then accomplishes the same goals as the above application via more "usual" web-scraping techniques (reading the HTML) will be necessary. The good news is, within a signed mobile application, it's much harder to fake the data we're retrieving, so there's less need for extensive fraud prevention measures. In addition, it might similarly be possible to sniff web requests within an iOS/Android webview, which would make our scrapers much more resilient. Either way, it would similarly need to be driven by a real-time connection to our servers.

chaitanyajun12 commented 4 years ago

Is this for client side web scraping of identity information ? Will this be possible since those web pages can be accessed post authentication only ?

Moreover, this web scraping shouldn't be happening randomly but rather should happen based on some events I guess. Please correct me. Even if possible can we 100% consider the user identity verified ?

carrollgt91 commented 4 years ago

@chaitanyajun12 it will be possible with client-side scraping leveraging some user input - especially with browser-extension based scrapers, it's amazing what you can accomplish.

It should definitely be triggered by user-interaction and with full user consent. It will likely need to be a "scan" where the user will initiate it, get some validations with timestamps from the various accounts they have opted into scanning, and then it will finish.

Re: 100% considering it identified - unfortunately, that's just not possible. It's all about it being "verified enough" for the use cases in question. For cases where you need 100% identity verification, you basically need to do a full background check - we're really targeting more everyday uses of identity verification, at least for now.

chaitanyajun12 commented 4 years ago

Got it @carrollgt91. Based on user's consent we will perform client side scraping. I think suddenly we will have access to whole lot of information at our disposal. But, do we know what specific information we are looking from that data. For instance, I gave permission to this tool post logging into by google account. During authorization, Google's SSO server will authorize us to use basic information like name, photo, contact list etc.,. Apart from this what other data points we want to access to boost the identity score. Is it like OTP verification completed sorts?

carrollgt91 commented 4 years ago

That's a really interesting question, as unlike with OAuth-based SSO solutions, the server providing the sign on (to use your example, Google's server) will not have any control over the data we're accessing. So even if they SSO to provide API access, if we're scraping, we're able to access any information the user has access to during their interaction with the website. If we didn't need that information, we'd just be using the SSO piece in the first place. However, I think it's important that we provide a similar experience with regards to enabling users to "opt in" to what data we are downloading for a given service so as to respect their wishes.

There's sort of two parts to this piece - One, how do we actually collect the data, and what is the experience for the user for the collection of said data?

Two, how do we utilize that data to verify user identity? How do we expose that data to apps which integrate with Opus in a way that maintains the privacy of the user, and doesn't risk exposing sensitive information about them?

chaitanyajun12 commented 4 years ago

@carrollgt91, sorry if I am reiterating again, here the requirement is

We are trying to make sure that the user who is using an account is the actual owner and not someone who is impersonating or proxying on behalf of the actual user. If we are able to achieve this we can regard the user as 100% identified.

Please correct me if this is actual definition of identification for us. If this is the requirement, I don't think even Google does this. Lets say I am accessing an my Gmail account, Google will not be knowing if it is actually me who is accessing the account. It just cares about the username and password match. To a certain extent it can flag that the account is being used from a different location sorts.

Taking Google only as example, we can access Manage google account, to find pages where the verification of the user is done via mobile, alternate mail ids and location history data. Probably, we can use the alternate mail id data to compare if there is a match between the current account and the alternate one (Moreover, they can definitely be completely be different accounts as well). But, all this data helps us to see how verified the user is but definitely not the identification of user.

If this is the requirement for us to identify 100%, I think as replied earlier bio-metric verification is what only I can think of.

One more thing is the UX part of it, why would a user allow for the data to be scraped that to it is secure and private data?