ampproject / amphtml

The AMP web component framework.
https://amp.dev
Apache License 2.0
14.89k stars 3.89k forks source link

Add support for client-side sessionization #1612

Open msukmanowsky opened 8 years ago

msukmanowsky commented 8 years ago

Analytics vendors like Parse.ly benefit from client-side sessionization in their standard JavaScript tracker and would like to see this supported, if possible, in AMP.

Sessions are a grouping of time-ordered pixel events such that the time between any two events does not exceed some threshold (usually 30m).

As far as the Parse.ly implementation goes, the following session parameters are tracked and sent along with pixels:

Name Description Example
Session ID Unique identifier for the session. An auto incrementing integer (0-indexed) that also tracks the total count of sessions. 2
Initial Session URL Initial URL of the session (the URL of the first pixel request) http://example.com/
Session Referrer document.referrer of the initial URL in the session https://www.google.com/
Session Timestamp Time in milliseconds since the Unix epoch that the session started 1453905846108
Prior Session Timestamp Time in milliseconds of the session prior to this one (or 0 if this is the first session for a user) 1453842891129

Session information can be stored either using cookies or some other means like local storage.

The following pseudocode gives an example implementation of fetching client-side session parameters:

// Assume StorageEngine pulls relevant info from cookies/local storage

/*
 Assume StorageEngine.get('visitor') returns an object from long-lived storage (i.e. expires after 1 or more years of inactivity)
 {
     visitorId: ...,
     sessionCount: 2,  // == session ID
     lastSessionTimestamp: 1453842891129
 };

 Likewise, assume StorageEngine.get('session') returns an object from short-lived storage (i.e. expires after 30m of inactivity)
 {
    id: 2,  // == sessionCount
    url: ...,
    referrer: ...,
    timestamp: ...,
 }
 */ 
function getSession() {
    var visitor = StorageEngine.get('visitor');
    var session = StorageEngine.get('session');
    if (typeof session == 'undefined') {
        // New session, increment the session count
        visitorInfo.sessionCount++;

        session = {
            id: visitor.sessionCount,
            url: document.location.href,
            referrer: document.referrer,
            timestamp: (new Date()).getTime(),
            lastSessionTimestamp: visitor.lastSessionTimestamp
        };
        StorageEngine.set('session', session);

        visitorInfo.lastSessionTimestamp = session.timestamp;
        StorageEngine.set('visitor', visitor);
    } else {
        // Session already exists, but since we have activity, expiry should be extended by 30m
        StorageEngine.extendExpiry('session');
    }

    // Session variables can now be used in amp-analytics pixels
    return session;
}
cramforce commented 8 years ago

@msukmanowsky @dvoytenko is currently working on the baseline primitive (the StorageEngine), but we cannot promise when this will be available AND we will initially only make it available for very low entropy values for privacy reasons (e.g. a few booleans per domain). The embedded nature of AMP requires additional thought when using features such as storage. As to whether we can provide a storage mechanism that can store user data (such as URLs and timestamps) is unclear right now.

I'd advise to plan without having such a mechanism in AMP for now.

msukmanowsky commented 8 years ago

No worries @cramforce, that is indeed our current plan. Thanks for the heads up.

querymetrics commented 8 years ago

Now that Storage is implemented, can we continue this discussion?

SOASTA is also interested in having "Session Ids" support in AMP. We would also like to have a way to track "Session Length" (A count of page views on the same domain). This would require the ability to store ints and a way to increment the value.

I understand that there are security concerns around storing arbitrary key/values in cookies. Could we come up with a set of metrics that are safe to share across analytics vendors that would be implemented with local storage and tied to the source url to reduce the risk of exposing data across publishers?

dvoytenko commented 8 years ago

@cramforce a restricted version of Storage is indeed available. Is it applicable here though?

cramforce commented 8 years ago

We primarily want to avoid too much storage per domain and would like to keep entropy low. For ints how big would they need to be?

querymetrics commented 8 years ago

@cramforce, for session length, an 8bit (or 3 digits in JSON) would be sufficient.

The Session Ids we use on standard pages are a UUID style string. If we were limited to storing small ints in AMP, we could use the ${ClientId} and append a small session int to it to create a unique session id.

cramforce commented 8 years ago

There is also PAGE_VIEW_ID if all you want is group requests from the same page view.

querymetrics commented 8 years ago

PAGE_VIEW_ID has a range of 0-9999.

How about storing something like this? : "amp-analytics:session":{"v":[<session start timestamp>,<session id>,<session length>],"t":<latest storage save timestamp>} Session id and session length could be in 0-9999 range. Saving it in array avoids extra keys and only uses one t timestamp. Doing the timeout check on the Storage.get avoids storing a timeout value in local storage.

I created a prototype implementation: https://github.com/ampproject/amphtml/issues/3048

cramforce commented 8 years ago

Before diving too deep into the pull request, I'd like to discuss things here. Unfortunately the PR will be relatively tough to land as changing the storage interface requires changing all its implementations, which can live in proprietary viewer code. Not impossible, but we need to make sure we are doing the right thing first.

Taking a step back, if all you got was a session id that

would that work for you? This mechanism would still require calculating session length on the server.

mattwelke commented 5 years ago

I see this issue was referenced from https://github.com/ampproject/amphtml/issues/12674 but I don't think grouping events from a particular page view helps with this. I work at an analytics vendor and we also currently use client side session stitching using cookies with a timeout set by a JS tracker. Because this isn't supported by AMP, we proceeded with adding AMP support by switching to group AMP beacons into sessions on the server. Having this instead though would make it much easier for us to implement.

It 2016, this was said:

Taking a step back, if all you got was a session id that

stays the same for a given site and timeout or stays the same for a given site and what the viewer defines as a sessions (viewers are often single page apps, so they don't necessarily need to persist sessions outside of RAM). would that work for you?

Yes, this would work for our use case. It would need to be a macro like CLIENT_ID(<fallback_cookie_name>) where it would be something like SESSION_ID(<fallback_cookie_name>, <fallback_cookie_timeout>). It would need to be compatible with AMP Linker the same way CLIENT_ID is compatible with AMP Linker right now, so that sessions could be tracked across AMP to non-AMP journeys.

As for the format of the Session ID identifier, I see that in the original feature request from Parse.ly, they describe their format:

Unique identifier for the session. An auto incrementing integer (0-indexed) that also tracks the total count of sessions.

For us, at GroupBy, we just need an ID with enough entropy to be unique across all page views. We use UUID generating libraries to implement this. I feel like our style would be simpler to implement, but I understand that the AMP Project must take into consideration how the internet at large wants to operate, so if it would be better to slowly implement this in a way that enables as many session tracking use cases as possible while respecting user privacy, I wouldn't mind being patient and helping to consult with the new feature.

mattwelke commented 5 years ago

@cramforce Tagging for visibility.

cramforce commented 5 years ago

@zhouyx Do you want to reconsider this?

zhouyx commented 5 years ago

From my understand, we need a solution to recognize request from the same session. Right now we have the PAGE_VIEW_ID and the high entropy PAGE_VIEW_ID_64 that's under active development. One should be able to group requests on the server side. Which in my opinion is preferred because by doing that server side, it improves the client side performance.

The only thing that I don't see here is the customization to the session definition. As @cramforce mentioned, session today is what the viewer defines as a session. There is no way to customize to 30 mins or so.

Would like to get more feedback on how important customizing the session time is for your use case. Thanks all.

mattwelke commented 5 years ago

@zhouyx So in our case, session plays a role that may be different compared to most analytics vendors.

We group received beacons into sessions before storing them to perform analysis on them. This is where grouping them server side is simple. We would just maintain a database of CLIENT_IDs seen so far, where we generate a session ID for each CLIENT_ID, expiring in 30 minutes or whichever time we feel is good in the future.

Where this gets difficult is the 2nd thing we do with session. We also provide a search service via HTTP API. We want to add personalization features to this search, first at the CLIENT_ID level and eventually at the session level, where the beacons collected influence the search results. This means the client would have to provide session ID when they call our search API. Just grouping beacons into sessions on the sever won't work for this, because the client needs to be aware of the session their beacons are going to be grouped into, so that their requests to the search API have that same session ID.

A workaround we thought of for doing this with AMP was to switch to setting cookies on the server instead of on the client. The beacons from amp-analytics and the requests to our search API all go to the same domain, so we could have every front end HTTP service sync with one database and set a session ID cookie. That cookie would be included implicitly in the future beacon requests and search API requests. Everything would be synced up.

But, since multiple front end services would potentially set the session ID cookie, we'd need to synchronize the operation of fetching or setting the session ID. We'd use something fast like Redis, but we still have scalability concerns, and it would require changing our entire architecture to enable collecting AMP beacons. This is why client side sessionization, for us, is a simpler option.

As @cramforce mentioned, session today is what the viewer defines as a session. There is no way to customize to 30 mins or so.

I think I have a gap in my AMP knowledge here. Can you point me at some materials that could explain this "viewer" concept to me? It might help me understand how setting cookies is done in the AMP runtime and what constraints exist.

zhouyx commented 5 years ago

I see. Thank you @welkie for the explaination.

Just grouping beacons into sessions on the sever won't work for this, because the client needs to be aware of the session their beacons are going to be grouped into, so that their requests to the search API have that same session ID.

I'm a bit curious about the implementation here. Specifically how long is each session. Also how to figure out the session their beacons are going to be grouped into?

Can you point me at some materials that could explain this "viewer" concept to me? It might help me understand how setting cookies is done in the AMP runtime and what constraints exist.

You can think of AMP Viewer as a container to display one or many AMP documents. The viewer determines what's a session for the embedded AMP documents. Here is a good doc about the AMP Viewer integration with AMP docs.

Setting cookie can be a bit tricky with embedded AMP documents due to the 3rd party cookie restraints. When embedded viewer, AMP docs are served in a cross domain iframe. Which introduced 3rd party cookie restrictions.

Because of the cookie restraints, the only option I see here is to use localStorage.

mattwelke commented 5 years ago

I'm a bit curious about the implementation here. Specifically how long is each session. Also how to figure out the session their beacons are going to be grouped into?

How we implement this right now for non-AMP sites is we have a JavaScript tracker client which sets cookies when its functions are called by the page's scripts (cookies set on that sites domain, not ours). The functions relate to sending beacons like sendViewProduct() and sendAddToCart(). One cookie is our "Visitor ID" which tracks who the visitor is, the cookie expires in 5 years. The other cookie is our "Session ID" which tracks the actions they perform on the site, with the cookie expiry set to 30 minutes, pushed forwards for another 30 minutes for every additional action on the website (e.g. viewing a product).

So because JS on the site owner's domain is controlling the setting of the cookie value, that JS can retrieve the cookie value before it interacts with our other public APIs like our Search service. The JS running on the site is the source of truth for the session's ID, and it can provide that value when beacons are sent (included automatically) or when it calls our public APIs (by pulling it from the cookie and including it as a parameter). It's a scalable, one way flow of data where our server doesn't need to stitch the beacons together and doesn't need to perform any synchronization.

Setting cookie can be a bit tricky with embedded AMP documents due to the 3rd party cookie restraints. When embedded viewer, AMP docs are served in a cross domain iframe. Which introduced 3rd party cookie restrictions.

Because of the cookie restraints, the only option I see here is to use localStorage.

This clears things up a bit for me. I'm not an expert on AMP (just discovered it a few months ago), but it sounds to me like implementing AMP Linker may have been a challenge. You guys would have had to take into account the restrictions imposed upon you when you implemented storing the CLIENT_ID value in a cookie.

I think to make this issue easier to deal with, we should separate out the contract and the implementation. The contract is just "I need the AMP runtime to generate a value that is stored somewhere it can be retrieved later, and transmitted via AMP Linker, with me being able to set an expiry time for the value". The implementations could be:

Just to reiterate, I understand that this contract only deals with the requirements my company has. I'm open to collaborating to come up with a standard for client side sessionization that addresses everyone's needs.

zhouyx commented 5 years ago

Thanks for the detailed explanation! cc @jeffjose. Jeff let's sync on this requirement next week.

So because JS on the site owner's domain is controlling the setting of the cookie value, that JS can retrieve the cookie value before it interacts with our other public APIs like our Search service.

If I get it correctly, here you are talking about non AMP pages. The session id is set by an AMP doc and passed to the NON AMP page via Linker param? Custom JS is only allowed with many restrictions on AMP. Thanks.

mattwelke commented 5 years ago

If I get it correctly, here you are talking about non AMP pages. The session id is set by an AMP doc and passed to the NON AMP page via Linker param? Custom JS is only allowed with many restrictions on AMP. Thanks.

Yep, this is how we do it on non-AMP pages. First party cookie is the source of truth, and it's mutated and retrieved with first party JS.

For our new system, if we had a new "Session ID" feature work like CLIENT_ID does now:

This assumes my understanding of the types of journeys users can take is correct, and how the cookies and AMP runtime/AMP linker interact is correct.

jeffjose commented 5 years ago

@welkie and others - thanks for providing context and rationale. @zhouyx and I chatted about this a bit offline and we think it'll be a good feature to add.

We'll add this request to the list of outstanding requests. At this moment, we're looking at working on this in Q4 2019.

mattwelke commented 4 years ago

@jeffjose Was wondering whether you guys have started to work on this in your roadmap. If you haven't had a chance to start on it yet, there's a chance that I'd be able to have dedicated dev time to help out. We're planning to simplify parts of our system in Q1 or Q2 2020 and being able to stop stitching sessions together server side by having AMP support a client side auto-expiring session ID would help with that.

amedina commented 4 years ago

@jeffjose is there an update on this request?

zhouyx commented 4 years ago

Synced with @jeffjose We decide to pick up the work. Please review the I2I at #29324. Thanks

adamsilverstein commented 3 years ago

Is work on this issue still on your roadmap?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.