WebMemex / webmemex-extension

📇 Your digital memory extension, as a browser extension
https://webmemex.org
Other
208 stars 45 forks source link

Location changes by web-apps (through history API) are not logged as multiple pages/visits #50

Closed rusuo closed 6 years ago

rusuo commented 7 years ago
  1. Go to youtube (there will be one entry created in the history)
  2. Select various songs, or leave the playlist to continue to the next song on its own

Actual results: There is only one entry created (the initial one), no other entries under the same domain appear. Note: If I open a new tab then it will be created as a new entry.

Expected results: A new entry is created for every different page visited

blackforestboi commented 7 years ago

Also important to mention, that this only happens on a few sites, for example Youtube or GitHub or Twitter.

edit: I have a hunch that it may have something to do with web apps? Because if I visit pages like: https://shuttleworthfoundation.org it does not have a problem. Also useful to mention, that this also happens for the WorldBrain extension.

Treora commented 7 years ago

Thanks for reporting! This is easily explainable, as the activity-logger currently listens to the browser.webNavigation.onCommitted event. We should also start listening to onHistoryStateUpdated events to catch these location changes that web-apps perform (as Oliver rightly suspected). Possibly just three lines of code for now, though it needs to be tested, and we will probably have to handle such location changes somewhat differently later on (e.g. when deduplicating pages).

blackforestboi commented 7 years ago

@Treora is this something that can be marked as newcomer issue?

gastonche commented 7 years ago

I'll get work on this right away. So if i understand the task well, i'll have to listen for onHistoryStateUpdated on the broswer and use it to add a new page to the extension.

mukeshkharita commented 7 years ago

Hello, I was trying to solve this issue I added this code in activity-logger/background/index.js https://paste.ubuntu.com/25170020/ but this stores the link two times so how can I resolve this problem?

Treora commented 7 years ago

Hi Mukesh; cool that you took initiative to fix this. My original idea of solving this with 'three lines of code' (roughly like your solution) may have been a bit optimistic however. Applications may update their history state many times, not all of which would be worth logging (exact duplicates like you report is one example). You may like to see the approach Gaston explored (#59, see line 43), which checks if the URL+tab combination had already been stored.

I now realise there may be an obvious criterion to tell whether a history state update is worth storing: if it used [pushState](https://developer.mozilla.org/en-US/docs/Web/API/History_API#The_pushState()_method) to create a new entry in the history, it is worth storing, and if it used [replaceState](https://developer.mozilla.org/en-US/docs/Web/API/History_API#The_replaceState()_method) it is not worth storing. The question is how to know this, it does not seem to be passed in the event details. Perhaps we could somehow get to know this from the history api, or in the worst case we might have to use a content script that spies on the push/replaceState functions as they are called.

To me, neither of these solutions seem worth the complexity at the moment, although I'd be happy to see your results if you explore any of them, it would be interesting research.

It may be worth noticing that logging every visited page has degraded to become only a secondary feature of this extension, because it would only be a pleasant feature if it is made a bit smarter, and e.g. ignore page refreshes and revisits, redirection pages, 404s, etcetera; and even then, the user demand for storing every visited page seems small. It seemed more tactical to me to first focus on manually storing pages.

By the way, there is still a separate problem with history state updates: there is no way to know when an update has completed; it would be pleasant if web-apps would emit some event when they are done with loading their page, but as far as I know, there is no standard way they report this. We could try guess when the application is ready by using some heuristics, for example by listening to network activity and dom changes, but I don't see a general solution.

Treora commented 6 years ago

Closing this issue; automatically logging visited pages is off the roadmap now, so this problem no longer exists.